My Random Forest model code concludes with:
print('\nModel performance:')
performance = best_nn.model_performance(test_data = test)
accuracy = performance.accuracy()
precision = performance.precision()
F1 = performance.F1()
auc = performance.auc()
print(' accuracy.................', accuracy)
print(' precision................', precision)
print(' F1.......................', F1)
print(' auc......................', auc)
and this code produces the following output:
Model performance:
accuracy................. [[0.6622929108639558, 0.9078947368421053]]
precision................ [[0.6622929108639558, 1.0]]
F1....................... [[0.304835115538703, 0.5853658536585366]]
auc...................... 0.9103448275862068
Why am I getting two numbers for accuracy, precision and F1, and what do they mean?
Charles
PS: My environment is:
H2O cluster uptime: 6 mins 02 secs
H2O cluster version: 3.10.4.8
H2O cluster version age: 2 months and 9 days
H2O cluster name: H2O_from_python_Charles_wdmhb7
H2O cluster total nodes: 1
H2O cluster free memory: 21.31 Gb
H2O cluster total cores: 8
H2O cluster allowed cores: 4
H2O cluster status: locked, healthy
H2O connection url: http://localhost:54321
H2O connection proxy:
H2O internal security: False
Python version: 3.6.2 final
the two numbers are the threshold and the value for that metric respectively. Once the threshold is determined the accuracy or precision metric can be calculated.
if you use model.confusion_matrix() you can see what threshold was used.
for example in binary classification, the "threshold" is the value (between 0 and 1) that determines what the predicted class label is. If your model predicts a 0.2 for a particular test case, and your threshold is 0.4, the predicted class label will be 0. If your threshold were 0.15, then the predicted class label would be 1.
Related
I am trying to run the optimization of a ML model using SparkTrials from the hyperopt library. I am running this on a single machine with 16 cores but when I run the following code which sets the number of cores to 8 I get a warning that seems to indicate that only one core is used.
SparkTrials accepts as an argument spark_session which in theory is where I set the number of cores.
Can anyone help me?
Thanks!
import os, shutil, tempfile
from hyperopt import fmin, tpe, hp, SparkTrials, STATUS_OK
import numpy as np
from sklearn import linear_model, datasets, model_selection
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").config('spark.local.dir', './').config("spark.executor.cores", 8).getOrCreate()
def gen_data(bytes):
"""
Generates train/test data with target total bytes for a random regression problem.
Returns (X_train, X_test, y_train, y_test).
"""
n_features = 100
n_samples = int(1.0 * bytes / (n_features + 1) / 8)
X, y = datasets.make_regression(n_samples=n_samples, n_features=n_features, random_state=0)
return model_selection.train_test_split(X, y, test_size=0.2, random_state=1)
def train_and_eval(data, alpha):
"""
Trains a LASSO model using training data with the input alpha and evaluates it using test data.
"""
X_train, X_test, y_train, y_test = data
model = linear_model.Lasso(alpha=alpha)
model.fit(X_train, y_train)
loss = model.score(X_test, y_test)
return {"loss": loss, "status": STATUS_OK}
def tune_alpha(objective):
"""
Uses Hyperopt's SparkTrials to tune the input objective, which takes alpha as input and returns loss.
Returns the best alpha found.
"""
best = fmin(
fn=objective,
space=hp.uniform("alpha", 0.0, 10.0),
algo=tpe.suggest,
max_evals=8,
trials=SparkTrials(parallelism=8,spark_session=spark))
return best["alpha"]
data_small = gen_data(10 * 1024 * 1024) # ~10MB
def objective_small(alpha):
# For small data, you might reference it directly.
return train_and_eval(data_small, alpha)
tune_alpha(objective_small)
Parallelism (8) is greater than the current total of Spark task slots
(1). If dynamic allocation is enabled, you might see more executors
allocated.
if you are in cluster: The core in Spark nomenclature is unrelated to the physical core in your CPU here with spark.executor.cores you specified the maximum number of thread(=task) each executor(you have one here) can run is 8 if you want to increase the number of executors you have to use --num-executors in command-line or spark.executor.instances configuration property in your code.
I suggest try something like this configuration if you are in a yarn cluster
spark.conf.set("spark.dynamicAllocation.enabled", "true")
spark.conf.set("spark.executor.cores", 4)
spark.conf.set("spark.dynamicAllocation.minExecutors","2")
spark.conf.set("spark.dynamicAllocation.maxExecutors","10")
please consider above options are not available in local mode
local: in local mode you only have one executor and if you want to change the number of its worker threads (which is one by default) you have to set your master like this local[*] or local[16]
Is this measure normalized between 0 and 1?
At https://scikit-learn.org/stable/modules/generated/sklearn.metrics.davies_bouldin_score.html I understand that it is not normalized but is this in scikit-learn? Or generally?
The minimum value is 0 but the maximum can be above 1.
From the documentation:
This index signifies the average ‘similarity’ between clusters, where
the similarity is a measure that compares the distance between
clusters with the size of the clusters themselves.
Zero is the lowest possible score. Values closer to zero indicate a better partition.
Example where the score is > 1:
from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.metrics import davies_bouldin_score
iris = datasets.load_iris()
X = iris.data
kmeans = KMeans(n_clusters=13, random_state=1).fit(X)
labels = kmeans.labels_
davies_bouldin_score(X, labels)
1.068885319440245
I'm learning python and I want to perform a simple linear regression on a .csv dataset. I've successfully imported the data file. If I have data for 8 five year periods and I want to do simple linear regression how would I do this? The data is by county/state. So my headers are county, state, 1980,1985 etc.. Appreciate any help.
Please specify what target label you have in mind.
Anyways, use the sklearn library and pandas.
val= pd.DataFrame(your.data);
val.columns = your.headers;
Assuming you have a target header called "Price".
from sklearn.linear_model import LinearRegression
X = val.drop('Price',axis=1)
X contains the data on which LR will be performed.
Now create a Linear Regression object.
lm = LinearRegression()
Start the fitting:
lm.fit()
Predict your targets:
lm.predict(x)
That's it.
Almost all real world problems that you are going to encounter will have more than two variables, so let's skip the basic linear regression example. Linear regressions involving multiple variables is called "multiple linear regression". The steps to perform multiple linear regression are almost similar to that of simple linear regression. The difference lies in the evaluation. You can use it to find out which factor has the highest impact on the predicted output and how different variables relate to each other.
In this section we will use multiple linear regression to predict the gas consumptions (in millions of gallons) in 48 US states based upon gas taxes (in cents), per capita income (dollars), paved highways (in miles) and the proportion of population that has a driver’s license.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# %matplotlib inline
dataset = pd.read_csv('C:/your_Path_here/petrol_consumption.csv')
dataset.head()
dataset.describe()
Result:
Index ... Consumption_In_Millions
count 48.00 ... 48.000000
mean 24.50 ... 576.770833
std 14.00 ... 111.885816
min 1.00 ... 344.000000
25% 12.75 ... 509.500000
50% 24.50 ... 568.500000
75% 36.25 ... 632.750000
max 48.00 ... 968.000000
Preparing the Data
The next step is to divide the data into attributes and labels as we did previously. However, unlike last time, this time around we are going to use column names for creating an attribute set and label. Execute the following script:
X = dataset[['Petrol_Tax', 'Average_Income', 'Paved_Highways', 'ProportionWithLicenses']]
y = dataset['Consumption_In_Millions']
Execute the following code to divide our data into training and test sets:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Training the Algorithm
And finally, to train the algorithm we execute the same code as before, using the fit() method of the LinearRegression class:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
As said earlier, in case of multivariable linear regression, the regression model has to find the most optimal coefficients for all the attributes. To see what coefficients our regression model has chosen, execute the following script:
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
coeff_df
Result:
Coefficient
Petrol_Tax -40.016660
Average_Income -0.065413
Paved_Highways -0.004741
ProportionWithLicenses 1341.862121
This means that for a unit increase in "petrol_tax", there is a decrease of 24.19 million gallons in gas consumption. Similarly, a unit increase in proportion of population with a driver’s license results in an increase of 1.324 billion gallons of gas consumption. We can see that "Average_income" and "Paved_Highways" have a very little effect on the gas consumption.
Making Predictions
To make pre-dictions on the test data, execute the following script:
y_pred = regressor.predict(X_test)
To compare the actual output values for X_test with the predicted values, execute the following script:
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df
Result:
Actual Predicted
29 534 469.391989
4 410 545.645464
26 577 589.668394
30 571 569.730413
32 577 649.774809
37 704 646.631164
34 487 511.608148
40 587 672.475177
7 467 502.074782
10 580 501.270734
Evaluating the Algorithm
The final step is to evaluate the performance of algorithm. We'll do this by finding the values for MAE, MSE and RMSE. Execute the following script:
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Results:
Mean Absolute Error: 56.822247479
Mean Squared Error: 4666.34478759
Root Mean Squared Error: 68.3106491522
You can see that the value of root mean squared error is 60.07, which is slightly greater than 10% of the mean value of the gas consumption in all states. This means that our algorithm was not very accurate but can still make reasonably good predictions.
There are many factors that may have contributed to this inaccuracy, a few of which are listed here:
1. Need more data: Only one year worth of data isn't that much, whereas having multiple years worth could have helped us improve the accuracy quite a bit.
2. Bad assumptions: We made the assumption that this data has a linear relationship, but that might not be the case. Visualizing the data may help you determine that.
3. Poor features: The features we used may not have had a high enough correlation to the values we were trying to predict.
Note: the data set used in this example is available from here.
http://people.sc.fsu.edu/~jburkardt/datasets/regression/x16.txt
Finally, see the two links below for more info on this topic.
https://stackabuse.com/linear-regression-in-python-with-scikit-learn/
https://jakevdp.github.io/PythonDataScienceHandbook/05.06-linear-regression.html
I have trained and stored a random forest binary classification model. Now I'm trying to simulate processing new (out-of-sample) data with this model. My Python (Anaconda 3.6) code is:
import h2o
import pandas as pd
import sys
localH2O = h2o.init(ip = "localhost", port = 54321, max_mem_size = "8G", nthreads = -1)
h2o.remove_all()
model_path = "C:/sm/BottleRockets/rf_model/DRF_model_python_1501621766843_28117";
model = h2o.load_model(model_path)
new_data = h2o.import_file(path="C:/sm/BottleRockets/new_data.csv")
print(new_data.head(10))
predict = model.predict(new_data) # predict returns a data frame
print(predict.describe())
predicted = predict[0,0]
probability = predict[0,2] # probability the prediction is a "1"
print('prediction: ', predicted, ', probability: ', probability)
When I run this code I get:
>>> import h2o
>>> import pandas as pd
>>> import sys
>>> localH2O = h2o.init(ip = "localhost", port = 54321, max_mem_size = "8G", nthreads = -1)
Checking whether there is an H2O instance running at http://localhost:54321. connected.
-------------------------- ------------------------------
H2O cluster uptime: 22 hours 22 mins
H2O cluster version: 3.10.5.4
H2O cluster version age: 18 days
H2O cluster name: H2O_from_python_Charles_0fqq0c
H2O cluster total nodes: 1
H2O cluster free memory: 6.790 Gb
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster status: locked, healthy
H2O connection url: http://localhost:54321
H2O connection proxy:
H2O internal security: False
Python version: 3.6.1 final
-------------------------- ------------------------------
>>> h2o.remove_all()
>>> model_path = "C:/sm/BottleRockets/rf_model/DRF_model_python_1501621766843_28117";
>>> model = h2o.load_model(model_path)
>>> new_data = h2o.import_file(path="C:/sm/BottleRockets/new_data.csv")
Parse progress: |█████████████████████████████████████████████████████████| 100%
>>> print(new_data.head(10))
BoxRatio Thrust Velocity OnBalRun vwapGain
---------- -------- ---------- ---------- ----------
1.502 55.044 0.38 37 0.845
[1 row x 5 columns]
>>> predict = model.predict(new_data) # predict returns a data frame
drf prediction progress: |████████████████████████████████████████████████| 100%
>>> print(predict.describe())
Rows:1
Cols:3
predict p0 p1
------- --------- ------------------ -------------------
type enum real real
mins 0.8849431818181818 0.11505681818181818
mean 0.8849431818181818 0.11505681818181818
maxs 0.8849431818181818 0.11505681818181818
sigma 0.0 0.0
zeros 0 0
missing 0 0 0
0 1 0.8849431818181818 0.11505681818181818
None
>>> predicted = predict[0,0]
>>> probability = predict[0,2] # probability the prediction is a "1"
>>> print('prediction: ', predicted, ', probability: ', probability)
prediction: 1 , probability: 0.11505681818181818
>>>
I am confused by the contents of the "predict" data frame. Please tell me what the numbers in the columns labeled "p0" and "p1" mean. I hope they are probabilities, and as you can see by my code, I am trying to get the predicted classification (0 or 1) and a probability that this classification is correct. Does my code correctly do that?
Any comments will be greatly appreciated.
Charles
p0 is the probability (between 0 and 1) that class 0 is chosen.
p1 is the probability (between 0 and 1) that class 1 is chosen.
The thing to keep in mind is that the "prediction" is made by applying a threshold to p1. That threshold point is chosen depending on whether you want to reduce false positives or false negatives. It's not just 0.5.
The threshold chosen for "the prediction" is max-F1. But you can extract out p1 yourself and threshold it any way you like.
Darren Cook asked me to post the first few lines of my training data. Here is is:
BoxRatio Thrust Velocity OnBalRun vwapGain Altitude
0 0.000 0.000 2.186 4.534 0.361 1
1 0.000 0.000 0.561 2.642 0.909 1
2 2.824 2.824 2.199 4.748 1.422 1
3 0.442 0.452 1.702 3.695 1.186 0
4 0.084 0.088 0.612 1.699 0.700 1
The response column is labeled "Altitude". Class 1 is what I want to see from new "out-of-sample" data. "1" is good, and it means that "Altitude" was reached (true positive). "0" means that "Altitude" was not reached (true negative). In the predict table above, "1" was predicted with a probability of 0.11505681818181818. This does not make sense to me.
Charles
I was reading through below website for decision tree classification part.
http://spark.apache.org/docs/latest/mllib-decision-tree.html
I built provided example code into my laptop and tried to understand it's output.
but I couldn't understand a bit. The below is the code and
sample_libsvm_data.txt can be found below https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt
Please refer the output, and let me know whether my opinion is correct. Here is my opinions.
Test Error mean it has approximately 95% correction based on training
Data?
(most curious one)if feature 434 is greater than 0.0 then, it would be 1 based on gini impurity? for example, the value is given as 434:178 then it would be 1.
from __future__ import print_function
from pyspark import SparkContext
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils
if __name__ == "__main__":
sc = SparkContext(appName="PythonDecisionTreeClassificationExample")
data = MLUtils.loadLibSVMFile(sc,'/home/spark/bin/sample_libsvm_data.txt')
(trainingData, testData) = data.randomSplit([0.7, 0.3])
model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={}, impurity='gini', maxDepth=5, maxBins=32)
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
print('Test Error = ' + str(testErr))
print('Learned classification tree model:')
print(model.toDebugString())
// =====Below is my output=====
Test Error = 0.0454545454545
Learned classification tree model:
DecisionTreeModel classifier of depth 1 with 3 nodes
If (feature 434 <= 0.0)
Predict: 0.0
Else (feature 434 > 0.0)
Predict: 1.0
I believe you are correct. Yes, your error rate is about 5%, so your algorithm is correct about 95% of the time for that 30% of the data you withheld as testing. According to your output (which I will assume is correct, I did not test the code myself), yes, the only feature that determines the class of the observation is feature 434, and if it is less than 0 it is 0, else 1.
Why in Spark ML, when training a decision tree model, the minInfoGain or minimum number of instances per node are not used to control the growth of the tree? It is very easy to over grow the tree.