I am submitting the training through a script file. Following is the content of the train.py script. Azure ML is treating all these as one run (instead of run per alpha value as coded below) as Run.get_context() is returning the same Run id.
train.py
from azureml.opendatasets import Diabetes
from azureml.core import Run
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.externals import joblib
import math
import os
import logging
# Load dataset
dataset = Diabetes.get_tabular_dataset()
print(dataset.take(1))
df = dataset.to_pandas_dataframe()
df.describe()
# Split X (independent variables) & Y (target variable)
x_df = df.dropna() # Remove rows that have missing values
y_df = x_df.pop("Y") # Y is the label/target variable
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=66)
print('Original dataset size:', df.size)
print("Size after dropping 'na':", x_df.size)
print("Training split size: ", x_train.size)
print("Test split size: ", x_test.size)
# Training
alphas = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] # Define hyperparameters
# Create and log interactive runs
output_dir = os.path.join(os.getcwd(), 'outputs')
for hyperparam_alpha in alphas:
# Get the experiment run context
run = Run.get_context()
print("Started run: ", run.id)
run.log("train_split_size", x_train.size)
run.log("test_split_size", x_train.size)
run.log("alpha_value", hyperparam_alpha)
# Train
print("Train ...")
model = Ridge(hyperparam_alpha)
model.fit(X = x_train, y = y_train)
# Predict
print("Predict ...")
y_pred = model.predict(X = x_test)
# Calculate & log error
rmse = math.sqrt(mean_squared_error(y_true = y_test, y_pred = y_pred))
run.log("rmse", rmse)
print("rmse", rmse)
# Serialize the model to local directory
if not os.path.isdir(output_dir):
os.makedirs(output_dir, exist_ok=True)
print("Save model ...")
model_name = "model_alpha_" + str(hyperparam_alpha) + ".pkl" # Pickle file
file_path = os.path.join(output_dir, model_name)
joblib.dump(value = model, filename = file_path)
# Upload the model
run.upload_file(name = model_name, path_or_stream = file_path)
# Complete the run
run.complete()
Experiments view
Authoring code (i.e. control plane)
import os
from azureml.core import Workspace, Experiment, RunConfiguration, ScriptRunConfig, VERSION, Run
ws = Workspace.from_config()
exp = Experiment(workspace = ws, name = "diabetes-local-script-file")
# Create new run config obj
run_local_config = RunConfiguration()
# This means that when we run locally, all dependencies are already provided.
run_local_config.environment.python.user_managed_dependencies = True
# Create new script config
script_run_cfg = ScriptRunConfig(
source_directory = os.path.join(os.getcwd(), 'code'),
script = 'train.py',
run_config = run_local_config)
run = exp.submit(script_run_cfg)
run.wait_for_completion(show_output=True)
Short Answer
Option 1: create child runs within run
run = Run.get_context() assigns the run object of the run that you're currently in to run. So in every iteration of the hyperparameter search, you're logging to the same run. To solve this, you need to create child (or sub-) runs for each hyperparameter value. You can do this with run.child_run(). Below is the template for making this happen.
run = Run.get_context()
for hyperparam_alpha in alphas:
# Get the experiment run context
run_child = run.child_run()
print("Started run: ", run_child.id)
run_child.log("train_split_size", x_train.size)
On the diabetes-local-script-file Experiment page, you can see that Run 9 was the parent run and Runs 10-19 were the child runs if you click "Include child runs" page. There is also a "Child runs" tab on Run 9 details page.
Long answer
I highly recommend abstracting the hyperparameter search away from the data plane (i.e. train.py) and into the control plane (i.e. "authoring code"). This becomes especially valuable as training time increases and you can arbitrarily parallelize and also choose Hyperparameters more intelligently by using Azure ML's Hyperdrive.
Option 2 Create runs from control plane
Remove the loop from your code, add the code like below (full data and control here)
import argparse
from pprint import pprint
parser = argparse.ArgumentParser()
parser.add_argument('--alpha', type=float, default=0.5)
args = parser.parse_args()
print("all args:")
pprint(vars(args))
# use the variable like this
model = Ridge(args.alpha)
below is how to submit a single run using a script argument. To submit multiple runs, just use a loop in the control plane.
alphas = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] # Define hyperparameters
list_rcs = [ScriptRunConfig(
source_directory = os.path.join(os.getcwd(), 'code'),
script = 'train.py',
arguments=['--alpha',a],
run_config = run_local_config) for a in alphas]
list_runs = [exp.submit(rc) for rc in list_rcs]
Option 3 Hyperdrive (IMHO the recommended approach)
In this way you outsource the hyperparameter source to Hyperdrive. The UI will also report results exactly how you want them, and via the API you can easily download the best model. Note you can't use this locally anymore and must use AMLCompute, but to me it is a worthwhile trade-off.This is a great overview. Excerpt below (full code here)
param_sampling = GridParameterSampling( {
"alpha": choice(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0)
}
)
estimator = Estimator(
source_directory = os.path.join(os.getcwd(), 'code'),
entry_script = 'train.py',
compute_target=cpu_cluster,
environment_definition=Environment.get(workspace=ws, name="AzureML-Tutorial")
)
hyperdrive_run_config = HyperDriveConfig(estimator=estimator,
hyperparameter_sampling=param_sampling,
policy=None,
primary_metric_name="rmse",
primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
max_total_runs=10,
max_concurrent_runs=4)
run = exp.submit(hyperdrive_run_config)
run.wait_for_completion(show_output=True)
Related
I am currently trying to optimize the hyperparameters of a gradient boosting method with the library hyperopt. When I was working on my own computer, I used the class Trials and I was able to save and reload my results with the library pickles. This allowed me to have a save of all the set of parameters I tested. My code looked like that :
from hyperopt import SparkTrials, STATUS_OK, tpe, fmin
from LearningUtils.LearningUtils import build_train_test, get_train_test, mean_error, rmse, mae
from LearningUtils.constants import MAX_EVALS, CV, XGBOOST_OPTIM_SPACE, PARALELISM
from sklearn.model_selection import cross_val_score
import pickle as pkl
if os.path.isdir(PATH_TO_TRIALS): #we reload the past results
with open(PATH_TO_TRIALS, 'rb') as trials_file:
trials = pkl.load(trials_file)
else : # We create the trials file
trials = Trials()
# classic hyperparameters optimization
def objective(space):
regressor = xgb.XGBRegressor(n_estimators = space['n_estimators'],
max_depth = int(space['max_depth']),
learning_rate = space['learning_rate'],
gamma = space['gamma'],
min_child_weight = space['min_child_weight'],
subsample = space['subsample'],
colsample_bytree = space['colsample_bytree'],
verbosity=0
)
regressor.fit(X_train, Y_train)
# Applying k-Fold Cross Validation
accuracies = cross_val_score(estimator=regressor, x=X_train, y=Y_train, cv=5)
CrossValMean = accuracies.mean()
return {'loss':1-CrossValMean, 'status': STATUS_OK}
best = fmin(fn=objective,
space=XGBOOST_OPTIM_SPACE,
algo=tpe.suggest,
max_evals=MAX_EVALS,
trials=trials,
return_argmin=False)
# Save the trials
pkl.dump(trials, open(PATH_TO_TRIALS, "wb"))
Now, I would like to make this code work on a distant serveur with more CPUs in order to allow parallelisation and gain time.
I saw that I can simply do that using the SparkTrials class of hyperopt instead ot Trials. But, SparkTrials objects cannot be saved with pickles. Do you have any idea on how I could save and reload my trials results stored in a Sparktrials object ?
so this might be a bit late, but after messing around a bit, I found a kind of hacky solution:
spark_trials= SparkTrials()
pickling_trials = dict()
for k, v in spark_trials.__dict__.items():
if not k in ['_spark_context', '_spark']:
pickling_trials[k] = v
pickle.dump(pickling_trials, open('pickling_trials.hyperopt', 'wb'))
The _spark_context and the _spark attributes of the SparkTrials instance are the culprits of not being able to serialize the object. It turns out that you dont need them if you want to re-use the object, because if you want to re-run the optimization again, a new spark context is created anyway, so you can re use the trials as:
new_sparktrials = SparkTrials()
for att, v in pickling_trials.items():
setattr(new_sparktrials, att, v)
best = fmin(loss_func,
space=search_space,
algo=tpe.suggest,
max_evals=1000,
trials=new_sparktrials)
voilà :)
This works (mostly from the demo sample at sklearn):
print(__doc__)
# Code source: Gaël Varoquaux
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from scipy.stats import uniform
lregress = LinearRegression()
pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('regress', lregress)])
# Plot the PCA spectrum
pca.fit(data_num)
plt.figure(1, figsize=(16, 9))
plt.clf()
plt.axes([.2, .2, .7, .7])
plt.plot(pca.explained_variance_, linewidth=2)
plt.axis('tight')
plt.xlabel('n_components')
plt.ylabel('explained_variance_')
# Prediction
n_components = uniform.rvs(loc=1, scale=data_num.shape[1], size=50,
random_state=42).astype(int)
# Parameters of pipelines can be set using ‘__’ separated parameter names:
estimator_pca = GridSearchCV(pipe,
dict(pca__n_components=n_components)
)
estimator_pca.fit(data_num, data_labels)
plt.axvline(estimator_pca.best_estimator_.named_steps['pca'].n_components,
linestyle=':', label='n_components chosen ' +
str(estimator_pca.best_estimator_.named_steps['pca'].n_components))
plt.legend(prop=dict(size=12))
plt.plot(np.cumsum(pca.explained_variance_ratio_), linewidth=1)
plt.show()
And this works:
from sklearn.feature_selection import RFECV
estimator = LinearRegression()
selector = RFECV(estimator, step=1, cv=5, scoring='explained_variance')
selector = selector.fit(data_num_pd, data_labels)
print("Selected number of features : %d" % selector.n_features_)
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score")
plt.plot(range(1, len(selector.grid_scores_) + 1), selector.grid_scores_)
plt.show()
but this gives me the error "RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes" on the line "selector1 = selector1.fit"
pca_est = estimator_pca.best_estimator_
selector1 = RFECV(pca_est, step=1, cv=5, scoring='explained_variance')
selector1 = selector1.fit(data_num_pd, data_labels)
print("Selected number of features : %d" % selector1.n_features_)
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score")
plt.plot(range(1, len(selector1.grid_scores_) + 1), selector1.grid_scores_)
plt.show()
How do I get my best-found PCA estimator to be used as the estimator in RFECV?
This is a known issue in pipeline design. Refer to the github page here:
Accessing fitted attributes:
Moreover, some fitted attributes are used by meta-estimators;
AdaBoostClassifier assumes its sub-estimator has a classes_ attribute
after fitting, which means that presently Pipeline cannot be used as
the sub-estimator of AdaBoostClassifier.
Either meta-estimators such as AdaBoostClassifier need to be
configurable in how they access this attribute, or meta-estimators
such as Pipeline need to make some fitted attributes of sub-estimators
accessible.
Same goes for other attributes like coef_ and feature_importances_. They are parts of last estimator so not exposed by pipeline.
Now you can try to follow the last para here and try to circumvent this to include it in pipeline, by doing something like this:
class Mypipeline(Pipeline):
#property
def coef_(self):
return self._final_estimator.coef_
#property
def feature_importances_(self):
return self._final_estimator.feature_importances_
And then using this new pipeline class in your code instead of original Pipeline.
This should work in most cases but not yours. You are doing feature reduction using PCA inside the pipeline. But want to do feature selection using RFECV. This in my opinion is not a good combination.
RFECV will keep on decreasing the number of features to be used. But the n_components in your best selected pca from above grid-search will be fixed. Then it will again throw an error when number of features become less than n_components. You cannot do anything in that case.
So I would advise you to think over your use case and code.
I used cross validation to train a linear regression model using the following code:
from pyspark.ml.evaluation import RegressionEvaluator
lr = LinearRegression(maxIter=maxIteration)
modelEvaluator=RegressionEvaluator()
pipeline = Pipeline(stages=[lr])
paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01]).addGrid(lr.elasticNetParam, [0, 1]).build()
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=modelEvaluator,
numFolds=3)
cvModel = crossval.fit(training)
now I want to draw the roc curve, I used the following code but I get this error:
'LinearRegressionTrainingSummary' object has no attribute 'areaUnderROC'
trainingSummary = cvModel.bestModel.stages[-1].summary
trainingSummary.roc.show()
print("areaUnderROC: " + str(trainingSummary.areaUnderROC))
I also want to check the objectiveHistory at each itaration, I know that I can get it at the end
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
but I want to get it at each iteration, how can I do this?
Moreover I want to evaluate the model on the test data, how can I do that?
prediction = cvModel.transform(test)
I know for the training data set I can write:
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)
but how can I get these metrics for testing data set?
1) The area under the ROC curve (AUC) is defined only for binary classification, hence you cannot use it for regression tasks, as you are trying to do here.
2) The objectiveHistory for each iteration is only available when the solver argument in the regression is l-bfgs (documentation); here is a toy example:
spark.version
# u'2.1.1'
from pyspark.ml import Pipeline
from pyspark.ml.linalg import Vectors
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
dataset = spark.createDataFrame(
[(Vectors.dense([0.0]), 0.2),
(Vectors.dense([0.4]), 1.4),
(Vectors.dense([0.5]), 1.9),
(Vectors.dense([0.6]), 0.9),
(Vectors.dense([1.2]), 1.0)] * 10,
["features", "label"])
lr = LinearRegression(maxIter=5, solver="l-bfgs") # solver="l-bfgs" here
modelEvaluator=RegressionEvaluator()
pipeline = Pipeline(stages=[lr])
paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01]).addGrid(lr.elasticNetParam, [0, 1]).build()
crossval = CrossValidator(estimator=lr,
estimatorParamMaps=paramGrid,
evaluator=modelEvaluator,
numFolds=3)
cvModel = crossval.fit(dataset)
trainingSummary = cvModel.bestModel.summary
trainingSummary.totalIterations
# 2
trainingSummary.objectiveHistory # one value for each iteration
# [0.49, 0.4511834723904831]
3) You have already defined a RegressionEvaluator which you can use for evaluating your test set but, if used without arguments, it assumes the RMSE metric; here is a way to define evaluators with different metrics and apply them to your test set (continuing the code from above):
test = spark.createDataFrame(
[(Vectors.dense([0.0]), 0.2),
(Vectors.dense([0.4]), 1.1),
(Vectors.dense([0.5]), 0.9),
(Vectors.dense([0.6]), 1.0)],
["features", "label"])
modelEvaluator.evaluate(cvModel.transform(test)) # rmse by default, if not specified
# 0.35384585061028506
eval_rmse = RegressionEvaluator(metricName="rmse")
eval_r2 = RegressionEvaluator(metricName="r2")
eval_rmse.evaluate(cvModel.transform(test)) # same as above
# 0.35384585061028506
eval_r2.evaluate(cvModel.transform(test))
# -0.001655087952929124
I am learning Distributed Tensorflow, and I implemented a simple version code of In-graph replication as below (task_parallel.py):
import argparse
import logging
import tensorflow as tf
log = logging.getLogger(__name__)
# Job Names
PARAMETER_SERVER = "ps"
WORKER_SERVER = "worker"
# Cluster Details
CLUSTER_SPEC = {
PARAMETER_SERVER: ["localhost:2222"],
WORKER_SERVER: ["localhost:1111", "localhost:1112", "localhost:1113"]}
def parse_command_arguments():
""" Set up and parse the command line arguments passed for experiment. """
parser = argparse.ArgumentParser(
description="Parameters and Arguments for the Test.")
parser.add_argument(
"--ps_hosts",
type=str,
default="",
help="Comma-separated list of hostname:port pairs"
)
parser.add_argument(
"--worker_hosts",
type=str,
default="",
help="Comma-separated list of hostname:port pairs"
)
parser.add_argument(
"--job_name",
type=str,
default="",
help="One of 'ps', 'worker'"
)
# Flags for defining the tf.train.Server
parser.add_argument(
"--task_index",
type=int,
default=0,
help="Index of task within the job"
)
return parser.parse_args()
def start_server(
job_name, ps_hosts, task_index, worker_hosts):
""" Create a server based on a cluster spec. """
cluster_spec = {
PARAMETER_SERVER: ps_hosts,
WORKER_SERVER: worker_hosts}
cluster = tf.train.ClusterSpec(cluster_spec)
server = tf.train.Server(
cluster, job_name=job_name, task_index=task_index)
return server
def model():
""" Build up a simple estimator model. """
with tf.device("/job:%s/task:0" % PARAMETER_SERVER):
log.info("111")
# Build a linear model and predict values
W = tf.Variable([.3], tf.float32)
b = tf.Variable([-.3], tf.float32)
x = tf.placeholder(tf.float32)
linear_model = W * x + b
y = tf.placeholder(tf.float32)
global_step = tf.Variable(0)
with tf.device("/job:%s/task:0" % WORKER_SERVER):
# Loss sub-graph
loss = tf.reduce_sum(tf.square(linear_model - y))
log.info("222")
# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.01)
with tf.device("/job:%s/task:1" % WORKER_SERVER):
log.info("333")
train = optimizer.minimize(loss, global_step=global_step)
return W, b, loss, x, y, train, global_step
def main():
# Parse arguments from command line.
arguments = parse_command_arguments()
# Initializing logging with level "INFO".
logging.basicConfig(level=logging.INFO)
ps_hosts = arguments.ps_hosts.split(",")
worker_hosts = arguments.worker_hosts.split(",")
job_name = arguments.job_name
task_index = arguments.task_index
# Start a server.
server = start_server(
job_name, ps_hosts, task_index, worker_hosts)
W, b, loss, x, y, train, global_step = model()
# with sv.prepare_or_wait_for_session(server.target) as sess:
with tf.train.MonitoredTrainingSession(
master=server.target,
is_chief=(arguments.task_index == 0 and (
arguments.job_name == 'ps')),
config=tf.ConfigProto(log_device_placement=True)) as sess:
step = 0
# training data
x_train = [1, 2, 3, 4]
y_train = [0, -1, -2, -3]
while not sess.should_stop() and step < 1000:
_, step = sess.run(
[train, global_step], {x: x_train, y: y_train})
# evaluate training accuracy
curr_W, curr_b, curr_loss = sess.run(
[W, b, loss], {x: x_train, y: y_train})
print("W: %s b: %s loss: %s" % (curr_W, curr_b, curr_loss))
if __name__ == "__main__":
main()
I ran the code with 3 different processes in a single machine (MacPro with only CPU):
PS: $python task_parallel.py --task_index 0 --ps_hosts localhost:2222 --worker_hosts localhost:1111,localhost:1112 --job_name ps,
Worker 1: $python task_parallel.py --task_index 0 --ps_hosts localhost:2222 --worker_hosts localhost:1111,localhost:1112 --job_name worker
Worker 2: $python task_parallel.py --task_index 1 --ps_hosts localhost:2222 --worker_hosts localhost:1111,localhost:1112 --job_name worker
I noticed that the results were not what I expected. Specifically, I expect process "PS" only prints 111, "Worker 1" only prints 222 and "Worker 3" only prints 333 as I specified task for each process. However, what I got is all 3 processes printed the exactly same thing:
INFO:__main__:111
INFO:__main__:222
INFO:__main__:333
Isn't true that process PS only executed the code inside of block with tf.device("/job:%s/task:0" % PARAMETER_SERVER? And same for workers? I wonder if I missed something in my code.
I also found that I had to run all worker processes first and run ps process afterwards. Otherwise, the worker processes cannot be gracefully exited after training was done. So I want to know any reasons for this issue in my code. Really appreciate for helps :) Thanks!
Please note that, in your snippet, the codes before MonitoredTrainingSession are used to describe and build the running graph, both parameter servers and workers will execute these codes to generate the graph. The graph will be frozen when the MonitoredTrainingSession is being created.
If you want to see 111 only in PS, your code may work like this:
FLAGS = tf.app.flags.FLAGS
if FLAGS.job_name == 'ps':
print('111')
server.join()
else:
print('222')
If you want to setup replicas model in workers, in model() function:
with tf.device('/job:ps/task:0'):
# define variable in parameter
with tf.device('/job:worker/task:%d' % FLAGS.task_index):
# define model in worker % task_index
Additionally, replica_device_setter will automatically assign devices to Operation objects as they are constructed.
There exists some examples provided by tensorflow, such as:
hello distributed, a basic guide in tensorflow tutorial.
mnist_replica.py, a distributed MNIST training and validation, with model replicas.
cifar10_multi_gpu_train.py, a binary to train CIFAR-10 using multiple GPU's with synchronous updates.
Wish this will help you.
I have created a Gaussian Naive Bayes classifier on a email (spam/not spam) dataset and was able to run it successfully. I vectorized the data, divided in it train and test sets and then calculated the accuracy, all the features that are present in the sklearn-Gaussian Naive Bayes classifier.
Now I want to be able to use this classifier to predict "labels" for new emails - whether they are by spam or not.
For example say I have an email. I want to feed it to my classifier and get the prediction as to whether it is a spam or not. How can I achieve this? Please Help.
Code for classifier file.
#!/usr/bin/python
import sys
from time import time
import logging
# Display progress logs on stdout
logging.basicConfig(level = logging.DEBUG, format = '%(asctime)s %(message)s')
sys.path.append("../DatasetProcessing/")
from vectorize_split_dataset import preprocess
### features_train and features_test are the features
for the training and testing datasets, respectively### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()
#########################################################
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
t0 = time()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print("training time:", round(time() - t0, 3), "s")
print(clf.score(features_test, labels_test))
## Printing Metrics
for Training and Testing
print("No. of Testing Features:" + str(len(features_test)))
print("No. of Testing Features Label:" + str(len(labels_test)))
print("No. of Training Features:" + str(len(features_train)))
print("No. of Training Features Label:" + str(len(labels_train)))
print("No. of Predicted Features:" + str(len(pred)))
## Calculating Classifier Performance
from sklearn.metrics import classification_report
y_true = labels_test
y_pred = pred
labels = ['0', '1']
target_names = ['class 0', 'class 1']
print(classification_report(y_true, y_pred, target_names = target_names, labels = labels))
# How to predict label of a new text
new_text = "You won a lottery at UK lottery commission. Reply to claim it"
Code for Vectorization
#!/usr/bin/python
import os
import pickle
import numpy
numpy.random.seed(42)
path = os.path.dirname(os.path.abspath(__file__))
### The words(features) and label_data(labels), already largely processed.###These files should have been created beforehand
feature_data_file = path + "./createdDataset/dataSet.pkl"
label_data_file = path + "./createdDataset/dataLabel.pkl"
feature_data = pickle.load(open(feature_data_file, "rb"))
label_data = pickle.load(open(label_data_file, "rb"))
### test_size is the percentage of events assigned to the test set(the### remainder go into training)### feature matrices changed to dense representations
for compatibility with### classifier functions in versions 0.15.2 and earlier
from sklearn import cross_validation
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(feature_data, label_data, test_size = 0.1, random_state = 42)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf = True, max_df = 0.5, stop_words = 'english')
features_train = vectorizer.fit_transform(features_train)
features_test = vectorizer.transform(features_test)#.toarray()
## feature selection to reduce dimensionality
from sklearn.feature_selection import SelectPercentile, f_classif
selector = SelectPercentile(f_classif, percentile = 5)
selector.fit(features_train, labels_train)
features_train_transformed_reduced = selector.transform(features_train).toarray()
features_test_transformed_reduced = selector.transform(features_test).toarray()
features_train = features_train_transformed_reduced
features_test = features_test_transformed_reduced
def preprocess():
return features_train, features_test, labels_train, labels_test
Code for dataset generation
#!/usr/bin/python
import os
import pickle
import re
import sys
# sys.path.append("../tools/")
""
"
Starter code to process the texts of accuate and inaccurate category to extract
the features and get the documents ready for classification.
The list of all the texts from accurate category are in the accurate_files list
likewise for texts of inaccurate category are in (inaccurate_files)
The data is stored in lists and packed away in pickle files at the end.
"
""
accurate_files = open("./rawDatasetLocation/accurateFiles.txt", "r")
inaccurate_files = open("./rawDatasetLocation/inaccurateFiles.txt", "r")
label_data = []
feature_data = []
### temp_counter is a way to speed up the development--there are### thousands of lines of accurate and inaccurate text, so running over all of them### can take a long time### temp_counter helps you only look at the first 200 lines in the list so you### can iterate your modifications quicker
temp_counter = 0
for name, from_text in [("accurate", accurate_files), ("inaccurate", inaccurate_files)]:
for path in from_text: ###only look at first 200 texts when developing### once everything is working, remove this line to run over full dataset
temp_counter = 1
if temp_counter < 200:
path = os.path.join('..', path[: -1])
print(path)
text = open(path, "r")
line = text.readline()
while line: ###use a
function parseOutText to extract the text from the opened text# stem_text = parseOutText(text)
stem_text = text.readline().strip()
print(stem_text)### use str.replace() to remove any instances of the words# stem_text = stem_text.replace("germani", "")### append the text to feature_data
feature_data.append(stem_text)### append a 0 to label_data
if text is from Sara, and 1
if text is from Chris
if (name == "accurate"):
label_data.append("0")
elif(name == "inaccurate"):
label_data.append("1")
line = text.readline()
text.close()
print("texts processed")
accurate_files.close()
inaccurate_files.close()
pickle.dump(feature_data, open("./createdDataset/dataSet.pkl", "wb"))
pickle.dump(label_data, open("./createdDataset/dataLabel.pkl", "wb"))
Also I want to know whether i can incrementally train the classifier meaning thereby that retrain a created model with newer data for refining the model over time?
I would be really glad if someone can help me out with this. I am really stuck at this point.
You are already using your model to predict labels of emails in your test set. This is what pred = clf.predict(features_test) does. If you want to see these labels, do print pred.
But perhaps you what to know how you can predict labels for emails that you discover in the future and that are not currently in your test set? If so, you can think of your new email(s) as a new test set. As with your previous test set, you will need to run several key processing steps on the data:
1) The first thing you need to do is to generate features for your new email data. The feature generation step is not included in your code above, but will need to occur.
2) You are using a Tfidf vectorizer, which converts a collection of documents to a matrix of Tfidf features based upon term frequency and inverse document frequency. You need to put your new email test feature data through the vectorizer that you fit on your training data.
3) Then your new email test feature data will need to go through dimensionality reduction using the same selector that you fit on your training data.
4) Finally, run predict on your new test data. Use print pred if you want to view the new label(s).
To respond to your final question about iteratively re-training your model, yes you definitely can do this. It's just a matter of selecting a frequency, producing a script that expands your data set with incoming data, then re-running all steps from there, from pre-processing to Tfidf vectorization, to dimensionality reduction, to fitting, and prediction.