Spark RandomForest classifier numClasses - apache-spark

Trained a RandomForest as this (Spark 1.6.0)
val numClasses = 4 // 0-2
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 9
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val impurity = "gini"
val maxDepth = 6
val maxBins = 32
val model = RandomForest.trainClassifier(trainRDD, numClasses,
categoricalFeaturesInfo, numTrees,
featureSubsetStrategy, impurity,
maxDepth, maxBins)
input labels:
labels = labeledRDD.map(lambda lp: lp.label).distinct().collect()
for label in sorted(labels):
print label
0.0
1.0
2.0
But the output only contain only two classes:
metrics = MulticlassMetrics(labelsAndPredictions)
df_confusion = metrics.confusionMatrix()
display_cm(df_confusion)
Output:
83017.0 81.0 0.0
8703.0 2609.0 0.0
10232.0 255.0 0.0
Output from when I load the same model in pyspark and run it against the other data (parts of the above)
DenseMatrix([[ 1.75280000e+04, 3.26000000e+02],
[ 3.00000000e+00, 1.27400000e+03]])

It got better... I used pearson correlation to figure out which columns did not have any correlation. Deletes the ten lowest correlating columns and now I get ok results:
Test Error = 0.0401823
precision = 0.959818
Recall = 0.959818
ConfusionMatrix([[ 17323., 0., 359.],
[ 0., 1430., 92.],
[ 208., 170., 1049.]])

Related

Why Spark evaluator has avgMetrics attributes if it only returns 1 value?

I'm using MulticlassClassificationEvaluator to retrieve some metrics like F1-Score or accuracy in a Cross Validation in PySpark:
cross_result = CrossValidator(estimator=RandomForestClassifier(),
estimatorParamMaps=ParamGridBuilder().build(),
evaluator=MulticlassClassificationEvaluator(metricName='f1'),
numFolds=5,
parallelism=-1)
f1_score = cross_result.avgMetrics[0]
Now, my question is: why is avgMetrics a list if it only has one value? Doesn't It should be a scalar value? Am I missing something about this attribute?
Following the source code, I realized that avgMetrics is a list with the average of all the cross-validation folds of the metric for each parameter defined in ParamGrid. So:
dataset = spark.createDataFrame(
[(Vectors.dense([0.0]), 0.0),
(Vectors.dense([0.6]), 1.0),
(Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
lr = LogisticRegression()
# Note that there are three values for maxIter: 0, 1 and 5
grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1, 5]).build()
evaluator = MulticlassClassificationEvaluator(metricName='accuracy')
cv = CrossValidator(
estimator=lr,
estimatorParamMaps=grid,
evaluator=evaluator,
parallelism=2
)
cvModel = cv.fit(dataset)
cvModel.avgMetrics[0] # Average accuracy for maxIter = 0
cvModel.avgMetrics[1] # Average accuracy for maxIter = 1
cvModel.avgMetrics[2] # Average accuracy for maxIter = 5

how to batch a variable length spectogram in tensorflow

I have to train a denoising autoencoder but i need to batch the 5-frame noisy powerspectrum with 1 frame clean powerspectrum , but i dono how to batch the spectrogram since my data are all variable length in time-series.
def parse_line(noise_file,clean_file):
noise_binary = tf.read_file(noise_file)
noise_binary = tf.contrib.ffmpeg.decode_audio(noise_binary, file_format='wav', samples_per_second=16000, channel_count=1)
noise_stfts = tf.contrib.signal.stft(tf.reshape(noise_binary, [1, -1]), frame_length=512, frame_step=256,fft_length=512)
noise_powerspectrum = tf.log(tf.abs(noise_stfts)**2)
noise_data = tf.squeeze(tf.contrib.signal.frame(noise_powerspectrum,frame_length=5,frame_step=1,axis=1))
clean_binary = tf.read_file(clean_file)
clean_binary = tf.contrib.ffmpeg.decode_audio(clean_binary, file_format='wav', samples_per_second=16000, channel_count=1)
clean_stfts = tf.contrib.signal.stft(tf.reshape(clean_binary, [1, -1]), frame_length=512, frame_step=256,fft_length=512)
clean_powerspectrum = tf.log(tf.abs(clean_stfts)**2)
clean_data = tf.squeeze(clean_powerspectrum)[:-4]
return noise_data, clean_data
my tf.data pipeline is as shown below
shuffle_batch = 10
batch_size = 10
dataset = tf.data.Dataset.from_tensor_slices((noise_datalist,clean_datalist))
dataset = dataset.shuffle(shuffle_batch) # shuffle number of files perbatch
dataset = dataset.map(parse_line,num_parallel_calls=8)
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(tf.contrib.data.AUTOTUNE)
dataset = dataset.make_one_shot_iterator()
next_element = dataset.get_next()
this is the errors that shows
InvalidArgumentError (see above for traceback): Cannot batch tensors with different shapes in component 0. First element had shape [443,5,257] and element 1 had shape [280,5,257].
[[{{node IteratorGetNext}} = IteratorGetNext[output_shapes=[<unknown>, <unknown>], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator)]]
when i change the batch_size to 1 it works and get one data. How can I batch this variable length data or even maybe batch all data to 1 like [443,5,257] and [280,5,257] to [723,5,257]?

Tensorflow Extracting Classification Predictions

I've a tensorflow NN model for classification of one-hot-encoded group labels (groups are exclusive), which ends with (layerActivs[-1] are the activations of the final layer):
probs = sess.run(tf.nn.softmax(layerActivs[-1]),...)
classes = sess.run(tf.round(probs))
preds = sess.run(tf.argmax(classes))
The tf.round is included to force any low probabilities to 0. If all probabilities are below 50% for an observation, this means that no class will be predicted. I.e., if there are 4 classes, we could have probs[0,:] = [0.2,0,0,0.4], so classes[0,:] = [0,0,0,0]; preds[0] = 0 follows.
Obviously this is ambiguous, as it is the same result that would occur if we had probs[1,:]=[.9,0,.1,0] -> classes[1,:] = [1,0,0,0] -> 1 preds[1] = 0. This is a problem when using the tensorflow builtin metrics class, as the functions can't distinguish between no prediction, and prediction in class 0. This is demonstrated by this code:
import numpy as np
import tensorflow as tf
import pandas as pd
''' prepare '''
classes = 6
n = 100
# simulate data
np.random.seed(42)
simY = np.random.randint(0,classes,n) # pretend actual data
simYhat = np.random.randint(0,classes,n) # pretend pred data
truth = np.sum(simY == simYhat)/n
tabulate = pd.Series(simY).value_counts()
# create placeholders
lab = tf.placeholder(shape=simY.shape, dtype=tf.int32)
prd = tf.placeholder(shape=simY.shape, dtype=tf.int32)
AM_lab = tf.placeholder(shape=simY.shape,dtype=tf.int32)
AM_prd = tf.placeholder(shape=simY.shape,dtype=tf.int32)
# create one-hot encoding objects
simYOH = tf.one_hot(lab,classes)
# create accuracy objects
acc = tf.metrics.accuracy(lab,prd) # real accuracy with tf.metrics
accOHAM = tf.metrics.accuracy(AM_lab,AM_prd) # OHE argmaxed to labels - expected to be correct
# now setup to pretend we ran a model & generated OHE predictions all unclassed
z = np.zeros(shape=(n,classes),dtype=float)
testPred = tf.constant(z)
''' run it all '''
# setup
sess = tf.Session()
sess.run([tf.global_variables_initializer(),tf.local_variables_initializer()])
# real accuracy with tf.metrics
ACC = sess.run(acc,feed_dict = {lab:simY,prd:simYhat})
# OHE argmaxed to labels - expected to be correct, but is it?
l,p = sess.run([simYOH,testPred],feed_dict={lab:simY})
p = np.argmax(p,axis=-1)
ACCOHAM = sess.run(accOHAM,feed_dict={AM_lab:simY,AM_prd:p})
sess.close()
''' print stuff '''
print('Accuracy')
print('-known truth: %0.4f'%truth)
print('-on unprocessed data: %0.4f'%ACC[1])
print('-on faked unclassed labels data (s.b. 0%%): %0.4f'%ACCOHAM[1])
print('----------\nTrue Class Freqs:\n%r'%(tabulate.sort_index()/n))
which has the output:
Accuracy
-known truth: 0.1500
-on unprocessed data: 0.1500
-on faked unclassed labels data (s.b. 0%): 0.1100
----------
True Class Freqs:
0 0.11
1 0.19
2 0.11
3 0.25
4 0.17
5 0.17
dtype: float64
Note freq for class 0 is same as faked accuracy...
I experimented with setting a value of preds to np.nan for observations with no predictions, but tf.metrics.accuracy throws ValueError: cannot convert float NaN to integer; also tried np.inf but got OverflowError: cannot convert float infinity to integer.
How can I convert the rounded probabilities to class predictions, but appropriately handle unpredicted observations?
This has gone long enough without an answer, so I'll post here as the answer my solution. I convert belonging probabilities to class predictions with a new function that has 3 main steps:
set any NaN probabilities to 0
set any probabilities below 1/num_classes to 0
use np.argmax() to extract predicted classes, then set any unclassed observations to a uniformly selected class
The resultant vector of integer class labels can be passed to the tf.metrics functions. My function below:
def predFromProb(classProbs):
'''
Take in as input an (m x p) matrix of m observations' class probabilities in
p classes and return an m-length vector of integer class labels (0...p-1).
Probabilities at or below 1/p are set to 0, as are NaNs; any unclassed
observations are randomly assigned to a class.
'''
numClasses = classProbs.shape[1]
# zero out class probs that are at or below chance, or NaN
probs = classProbs.copy()
probs[np.isnan(probs)] = 0
probs = probs*(probs > 1/numClasses)
# find any un-classed observations
unpred = ~np.any(probs,axis=1)
# get the predicted classes
preds = np.argmax(probs,axis=1)
# randomly classify un-classed observations
rnds = np.random.randint(0,numClasses,np.sum(unpred))
preds[unpred] = rnds
return preds

Mllib missing values handling

I'm using the corr from mllib with basic interface
like
val a:RDD[Double] = sc.makeRDD(Seq(1., 1., 0.))
val b:RDD[Double] = sc.makeRDD(Seq(1., -1., 0.))
val r = Statistics.corr(a, b)
println(r)
Is there a possibility to have casewise or pairwise removal of NAN and Infinity values?
By Default Mllib provides NAN as a result of corr in case of infinity or NAN values.
To my knowledge, there is no built-in function and you need to filter those values out by your own. One approach is to use java.Double (http://docs.oracle.com/javase/7/docs/api/java/lang/Double.html) functionality:
import java.lang.Double.isNaN
import java.lang.Double.isInfinite
val filtered1 = data1.filter((!isNaN(_))&&(!isInfinite(_)))
val filtered2 = data2.filter((!isNaN(_))&&(!isInfinite(_)))
val r = Statistics.corr(filtered1, filtered2)
println(r)

Linear regression in Apache Spark giving wrong intercept and weights

Using MLLib LinearRegressionWithSGD for the dummy data set (y, x1, x2) for y = (2*x1) + (3*x2) + 4 is producing wrong intercept and weights. Actual data used is,
x1 x2 y
1 0.1 6.3
2 0.2 8.6
3 0.3 10.9
4 0.6 13.8
5 0.8 16.4
6 1.2 19.6
7 1.6 22.8
8 1.9 25.7
9 2.1 28.3
10 2.4 31.2
11 2.7 34.1
I set the following input parameters and got the below model outputs
[numIterations, step, miniBatchFraction, regParam] [intercept, [weights]]
[5,9,0.6,5] = [2.36667135839938E13, weights:[1.708772545209758E14, 3.849548062850367E13] ]
[2,default,default,default] = [-2495.5635231554793, weights:[-19122.41357929275,-4308.224496146531]]
[5,default,default,default] = [2.875191315671051E8, weights: [2.2013802074495964E9,4.9593017130199933E8]]
[20,default,default,default] = [-8.896967235537095E29, weights: [-6.811932001659158E30,-1.5346020624812824E30]]
Need to know,
How do i get the correct intercept and weights [4, [2, 3]] for the above mentioned dummy data.
Will tuning the step size help in convergence? I need to run this in a automated manner for several hundred variables, so not keen to do that.
Should I scale the data? How will it help?
Below is the code used to generate these results.
object SciBenchTest {
def main(args: Array[String]): Unit = run
def run: Unit = {
val sparkConf = new SparkConf().setAppName("SparkBench")
val sc = new SparkContext(sparkConf)
// Load and parse the dummy data (y, x1, x2) for y = (2*x1) + (3*x2) + 4
// i.e. intercept should be 4, weights (2, 3)?
val data = sc.textFile("data/dummy.csv")
// LabeledPoint is (label, [features])
val parsedData = data.map { line =>
val parts = line.split(',')
val label = parts(2).toDouble
val features = Array(parts(0), parts(1)) map (_.toDouble)
LabeledPoint(label, Vectors.dense(features))
}
//parsedData.collect().foreach(x => println(x));
// Scale the features
/*val scaler = new StandardScaler(withMean = true, withStd = true)
.fit(parsedData.map(x => x.features))
val scaledData = parsedData
.map(x =>
LabeledPoint(x.label,
scaler.transform(Vectors.dense(x.features.toArray))))
scaledData.collect().foreach(x => println(x));*/
// Building the model: SGD = stochastic gradient descent
val numIterations = 20 //5
val step = 9.0 //9.0 //0.7
val miniBatchFraction = 0.6 //0.7 //0.65 //0.7
val regParam = 5.0 //3.0 //10.0
//val model = LinearRegressionWithSGD.train(parsedData, numIterations, step) //scaledData
val algorithm = new LinearRegressionWithSGD() //train(parsedData, numIterations)
algorithm.setIntercept(true)
algorithm.optimizer
//.setMiniBatchFraction(miniBatchFraction)
.setNumIterations(numIterations)
//.setStepSize(step)
//.setGradient(new LeastSquaresGradient())
//.setUpdater(new SquaredL2Updater()) //L1Updater //SimpleUpdater //SquaredL2Updater
//.setRegParam(regParam)
val model = algorithm.run(parsedData)
println(s">>>> Model intercept: ${model.intercept}, weights: ${model.weights}")
// Evaluate model on training examples
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, point.features, prediction)
}
// Print out features, actual and predicted values...
valuesAndPreds.take(10).foreach({ case (v, f, p) =>
println(s"Features: ${f}, Predicted: ${p}, Actual: ${v}")
})
}
}
As described in the documentation
https://spark.apache.org/docs/1.0.2/mllib-optimization.html
selecting the best step-size for SGD methods can often be delicate.
I would try with lover values, for example
// Build linear regression model
var regression = new LinearRegressionWithSGD().setIntercept(true)
regression.optimizer.setStepSize(0.001)
val model = regression.run(parsedData)
Adding the stepsize did not help us much.
We used the following parameters to calculate the intercept/weights and loss and used the same to construct a linear regression model in order to predict our features. Thanks #selvinsource for pointing me in the correct direction.
val data = sc.textFile("data/dummy.csv")
// LabeledPoint is (label, [features])
val parsedData = data.map { line =>
val parts = line.split(',')
val label = parts(2).toDouble
val features = Array(parts(0), parts(1)) map (_.toDouble)
(label, MLUtils.appendBias(Vectors.dense(features)))
}.cache()
val numCorrections = 5 //10//5//3
val convergenceTol = 1e-4 //1e-4
val maxNumIterations = 20 //20//100
val regParam = 0.00001 //0.1//10.0
val (weightsWithIntercept, loss) = LBFGS.runLBFGS(
parsedData,
new LeastSquaresGradient(),//LeastSquaresGradient
new SquaredL2Updater(), //SquaredL2Updater(),SimpleUpdater(),L1Updater()
numCorrections,
convergenceTol,
maxNumIterations,
regParam,
Vectors.dense(0.0, 0.0, 0.0))//initialWeightsWithIntercept)
loss.foreach(println)
val model = new LinearRegressionModel(
Vectors.dense(weightsWithIntercept.toArray.slice(0, weightsWithIntercept.size - 1)),
weightsWithIntercept(weightsWithIntercept.size - 1))
println(s">>>> Model intercept: ${model.intercept}, weights: ${model.weights}")
// Evaluate model on training examples
val valuesAndPreds = parsedData.collect().map { point =>
var prediction = model.predict(Vectors.dense(point._2.apply(0), point._2.apply(1)))
(prediction, point._1)
}
// Print out features, actual and predicted values...
valuesAndPreds.take(10).foreach({ case (v, f) =>
println(s"Features: ${f}, Predicted: ${v}")//, Actual: ${v}")
})

Resources