Apache Spark in Yarn not using all cores [duplicate] - apache-spark

I am running Spark on my local machine (16G,8 cpu cores). I was trying to train linear regression model on dataset of size 300MB. I checked the cpu statistics and also the programs running, it just executes one thread.
The documentation says they have implemented distributed version of SGD.
http://spark.apache.org/docs/latest/mllib-linear-methods.html#implementation-developer
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel
from pyspark import SparkContext
def parsePoint(line):
values = [float(x) for x in line.replace(',', ' ').split(' ')]
return LabeledPoint(values[0], values[1:])
sc = SparkContext("local", "Linear Reg Simple")
data = sc.textFile("/home/guptap/Dropbox/spark_opt/test.txt")
data.cache()
parsedData = data.map(parsePoint)
model = LinearRegressionWithSGD.train(parsedData)
valuesAndPreds = parsedData.map(lambda p: (p.label,model.predict(p.features)))
MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))
model.save(sc, "myModelPath")
sameModel = LinearRegressionModel.load(sc, "myModelPath")

I think what you want to do is explicitly state the number of cores to use with the local context. As you can see from the comments here, "local" (which is what you're doing) instantiates a context on one thread whereas "local[4]" will run with 4 cores. I believe you can also use "local[*]" to run on all cores on your system.

Related

TensorFlow vs PyTorch: Memory usage

I have PyTorch 1.9.0 and TensorFlow 2.6.0 in the same environment, and both recognizing the all GPUs.
I was comparing the performance of both, so I did this small simple test, multiplying large matrices (A and B, both 2000x2000) several times (10000x):
import numpy as np
import os
import time
def mul_torch(A,B):
# PyTorch matrix multiplication
os.environ['KMP_DUPLICATE_LIB_OK']='True'
import torch
A, B = torch.Tensor(A.copy()), torch.Tensor(B.copy())
A = A.cuda()
B = B.cuda()
start = time.time()
for i in range(10000):
C = torch.matmul(A, B)
torch.cuda.empty_cache()
print('PyTorch:', time.time() - start, 's')
return C
def mul_tf(A,B):
# TensorFlow Matrix Multiplication
import tensorflow as tf
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
with tf.device('GPU:0'):
A = tf.constant(A.copy())
B = tf.constant(B.copy())
start = time.time()
for i in range(10000):
C = tf.math.multiply(A, B)
print('TensorFlow:', time.time() - start, 's')
return C
if __name__ == '__main__':
A = np.load('A.npy')
B = np.load('B.npy')
n = 2000
A = np.random.rand(n, n)
B = np.random.rand(n, n)
PT = mul_torch(A, B)
time.sleep(5)
TF = mul_tf(A, B)
As a result:
PyTorch: 19.86856198310852 s
TensorFlow: 2.8338065147399902 s
I was not expecting these results, I thought the results should be similar.
Investigating the GPU performance, I noticed that both are using GPU full capacity, but PyTorch uses a small fraction of the memory Tensorflow uses. It explains the processing time difference, but I cannot explain the difference on memory usage. Is it something intrinsic to the methods, or is it my computer configuration? Regardless the matrix size (at least for matrices larger than 1000x1000), these plateau are the same.
Thanks you for your help.
It is because you are doing matrix multiplication in pytorch but element-wise multiplication in tensorflow. To do matrix multiplication in TF, use tf.matmul or simply:
for i in range(10000):
C = A # B
That does the same for both TF and torch. You also have to call torch.cuda.synchronize() inside the time measurement and move torch.cuda.empty_cache() outside of the measurement for the sake of fairness.
The expected results will be tensorflow's eager execution slower than pytorch.
Regarding the memory usage, TF by default claims all GPU memory and using nvidia-smi in linux or similarly task manager in windows, does not reflect the actual memory usage of the operations.

Setting task slots with pyspark on an individual machine

I am trying to run the optimization of a ML model using SparkTrials from the hyperopt library. I am running this on a single machine with 16 cores but when I run the following code which sets the number of cores to 8 I get a warning that seems to indicate that only one core is used.
SparkTrials accepts as an argument spark_session which in theory is where I set the number of cores.
Can anyone help me?
Thanks!
import os, shutil, tempfile
from hyperopt import fmin, tpe, hp, SparkTrials, STATUS_OK
import numpy as np
from sklearn import linear_model, datasets, model_selection
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").config('spark.local.dir', './').config("spark.executor.cores", 8).getOrCreate()
def gen_data(bytes):
"""
Generates train/test data with target total bytes for a random regression problem.
Returns (X_train, X_test, y_train, y_test).
"""
n_features = 100
n_samples = int(1.0 * bytes / (n_features + 1) / 8)
X, y = datasets.make_regression(n_samples=n_samples, n_features=n_features, random_state=0)
return model_selection.train_test_split(X, y, test_size=0.2, random_state=1)
def train_and_eval(data, alpha):
"""
Trains a LASSO model using training data with the input alpha and evaluates it using test data.
"""
X_train, X_test, y_train, y_test = data
model = linear_model.Lasso(alpha=alpha)
model.fit(X_train, y_train)
loss = model.score(X_test, y_test)
return {"loss": loss, "status": STATUS_OK}
def tune_alpha(objective):
"""
Uses Hyperopt's SparkTrials to tune the input objective, which takes alpha as input and returns loss.
Returns the best alpha found.
"""
best = fmin(
fn=objective,
space=hp.uniform("alpha", 0.0, 10.0),
algo=tpe.suggest,
max_evals=8,
trials=SparkTrials(parallelism=8,spark_session=spark))
return best["alpha"]
data_small = gen_data(10 * 1024 * 1024) # ~10MB
def objective_small(alpha):
# For small data, you might reference it directly.
return train_and_eval(data_small, alpha)
tune_alpha(objective_small)
Parallelism (8) is greater than the current total of Spark task slots
(1). If dynamic allocation is enabled, you might see more executors
allocated.
if you are in cluster: The core in Spark nomenclature is unrelated to the physical core in your CPU here with spark.executor.cores you specified the maximum number of thread(=task) each executor(you have one here) can run is 8 if you want to increase the number of executors you have to use --num-executors in command-line or spark.executor.instances configuration property in your code.
I suggest try something like this configuration if you are in a yarn cluster
spark.conf.set("spark.dynamicAllocation.enabled", "true")
spark.conf.set("spark.executor.cores", 4)
spark.conf.set("spark.dynamicAllocation.minExecutors","2")
spark.conf.set("spark.dynamicAllocation.maxExecutors","10")
please consider above options are not available in local mode
local: in local mode you only have one executor and if you want to change the number of its worker threads (which is one by default) you have to set your master like this local[*] or local[16]

Faster K-Means Clustering in TensorFlow

Dear TensorFlow Community,
I'm training a classifier with tf.contrib.factorization.KMeansClustering,
but the training goes really slow, and only uses 1% of my GPU.
However, my 4 CPU cores are hitting about 35% use constantly.
Is it the case that K-Means is written more for the CPU than the GPU?
Is there a way I can shift more of the computation to the GPU, or some
other approach to speed up training?
Below is my script for training (Python3).
Thank you for your time.
import tensorflow as tf
def parser(record):
features={
'feats': tf.FixedLenFeature([], tf.string),
}
parsed = tf.parse_single_example(record, features)
feats = tf.convert_to_tensor(tf.decode_raw(parsed['feats'], tf.float64))
return {'feats': feats}
def my_input_fn(tfrecords_path):
dataset = (
tf.data.TFRecordDataset(tfrecords_path)
.map(parser)
.batch(1024)
)
iterator = dataset.make_one_shot_iterator()
batch_feats = iterator.get_next()
return batch_feats
### SPEC FUNCTIONS ###
train_spec_kmeans = tf.estimator.TrainSpec(input_fn = lambda: my_input_fn('/home/ubuntu/train.tfrecords') , max_steps=10000)
eval_spec_kmeans = tf.estimator.EvalSpec(input_fn = lambda: my_input_fn('/home/ubuntu/eval.tfrecords') )
### INIT ESTIMATOR ###
KMeansEstimator = tf.contrib.factorization.KMeansClustering(
num_clusters=500,
feature_columns = [tf.feature_column.numeric_column(
key='feats',
dtype=tf.float64,
shape=(377,),
)],
use_mini_batch=True)
### TRAIN & EVAL ###
tf.estimator.train_and_evaluate(KMeansEstimator, train_spec_kmeans, eval_spec_kmeans)
Best,
Josh
Here's my best answer so far with time information, building off of Eliethesaiyan's answer and link to docs.
My original Dataset codeblock and performance:
dataset = (
tf.data.TFRecordDataset(tfrecords_path)
.map(parse_fn)
.batch(1024)
)
real 1m36.171s
user 2m57.756s
sys 0m42.304s
Eliethesaiyan's answer (prefetch + num_parallel_calls)
dataset = (
tf.data.TFRecordDataset(tfrecords_path)
.map(parse_fn,num_parallel_calls=multiprocessing.cpu_count())
.batch(1024)
.prefetch(1024)
)
real 0m41.450s
user 1m33.120s
sys 0m18.772s
From the docs using map_and_batch + num_parallel_batches + prefetch:
dataset = (
tf.data.TFRecordDataset(tfrecords_path)
.apply(
tf.contrib.data.map_and_batch(
map_func=parse_fn,
batch_size=1024,
num_parallel_batches=multiprocessing.cpu_count()
)
)
.prefetch(1024)
)
real 0m32.855s
user 1m11.412s
sys 0m10.408s
one of the thing that i saw that increases gpu and cpu usage, is using prefetch on the dataset.It keeps the dataset producer fetch the data while the model is also consuming the previous batch therefore maximizing resource usage. Also specifying the max of your cpu would speed up the process.
I would restructure it this way
dataset = (
tf.data.TFRecordDataset(tfrecords_path)
.map(parser,num_parallel_calls=multiprocessing.cpu_count())
.batch(1024)
)
dataset = dataset.prefetch(1024)
here is a nice guide of best practice when it comes to use TfRecords here

not able to restore tensorflow graph using python3 multiprocessing

I want to be able to load an existing tensorflow network simultaneously from several processes using multiprocessing library to do inference on different cores simultaneously.
def spawn_process(x):
g = tf.Graph()
sess = tf.Session(graph=g)
with g.as_default():
meta_graph = tf.train.import_meta_graph('model.meta')
meta_graph.restore(sess, tf.train.latest_checkpoint(chkp_dir)
x_ph = tf.get_collection('x')[0] # placeholder tensor that we use to pass in new x's to do inference on.
pred = tf.get_collection('pred')[0] # tensor responsible for computing prediction given x
prediction = sess.run(pred, feed_dict={x_ph: x})
This is basically the function I want to pass to Pool.map to infer parallely.
Above function works with just map and it looks like this
predictions = list(map(spawn_process, range(10)))
predictions = [spawn_process(x) for x in range(10)]
Both of the above work as expected.
But when I try do this, it fails, and each process just hangs right before the meta_graph.restore line and I'm stumped.
p = multiprocess.Pool(4)
predictions = p.map(spawn_process, range(10))
p.close()
p.join()
I don't know why this isn't working with tensorflow, this normally works for me when I try to do any sort of computation parallely. It stops right before the meta_graph.restore line and all the processes hang there.

Spark Decision Tree with Spark

I was reading through below website for decision tree classification part.
http://spark.apache.org/docs/latest/mllib-decision-tree.html
I built provided example code into my laptop and tried to understand it's output.
but I couldn't understand a bit. The below is the code and
sample_libsvm_data.txt can be found below https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt
Please refer the output, and let me know whether my opinion is correct. Here is my opinions.
Test Error mean it has approximately 95% correction based on training
Data?
(most curious one)if feature 434 is greater than 0.0 then, it would be 1 based on gini impurity? for example, the value is given as 434:178 then it would be 1.
from __future__ import print_function
from pyspark import SparkContext
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils
if __name__ == "__main__":
sc = SparkContext(appName="PythonDecisionTreeClassificationExample")
data = MLUtils.loadLibSVMFile(sc,'/home/spark/bin/sample_libsvm_data.txt')
(trainingData, testData) = data.randomSplit([0.7, 0.3])
model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={}, impurity='gini', maxDepth=5, maxBins=32)
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
print('Test Error = ' + str(testErr))
print('Learned classification tree model:')
print(model.toDebugString())
// =====Below is my output=====
Test Error = 0.0454545454545
Learned classification tree model:
DecisionTreeModel classifier of depth 1 with 3 nodes
If (feature 434 <= 0.0)
Predict: 0.0
Else (feature 434 > 0.0)
Predict: 1.0
I believe you are correct. Yes, your error rate is about 5%, so your algorithm is correct about 95% of the time for that 30% of the data you withheld as testing. According to your output (which I will assume is correct, I did not test the code myself), yes, the only feature that determines the class of the observation is feature 434, and if it is less than 0 it is 0, else 1.
Why in Spark ML, when training a decision tree model, the minInfoGain or minimum number of instances per node are not used to control the growth of the tree? It is very easy to over grow the tree.

Resources