Setting task slots with pyspark on an individual machine

Setting task slots with pyspark on an individual machine - apache-spark

I am trying to run the optimization of a ML model using SparkTrials from the hyperopt library. I am running this on a single machine with 16 cores but when I run the following code which sets the number of cores to 8 I get a warning that seems to indicate that only one core is used.
SparkTrials accepts as an argument spark_session which in theory is where I set the number of cores.
Can anyone help me?
Thanks!
import os, shutil, tempfile
from hyperopt import fmin, tpe, hp, SparkTrials, STATUS_OK
import numpy as np
from sklearn import linear_model, datasets, model_selection
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").config('spark.local.dir', './').config("spark.executor.cores", 8).getOrCreate()
def gen_data(bytes):
"""
Generates train/test data with target total bytes for a random regression problem.
Returns (X_train, X_test, y_train, y_test).
"""
n_features = 100
n_samples = int(1.0 * bytes / (n_features + 1) / 8)
X, y = datasets.make_regression(n_samples=n_samples, n_features=n_features, random_state=0)
return model_selection.train_test_split(X, y, test_size=0.2, random_state=1)
def train_and_eval(data, alpha):
"""
Trains a LASSO model using training data with the input alpha and evaluates it using test data.
"""
X_train, X_test, y_train, y_test = data
model = linear_model.Lasso(alpha=alpha)
model.fit(X_train, y_train)
loss = model.score(X_test, y_test)
return {"loss": loss, "status": STATUS_OK}
def tune_alpha(objective):
"""
Uses Hyperopt's SparkTrials to tune the input objective, which takes alpha as input and returns loss.
Returns the best alpha found.
"""
best = fmin(
fn=objective,
space=hp.uniform("alpha", 0.0, 10.0),
algo=tpe.suggest,
max_evals=8,
trials=SparkTrials(parallelism=8,spark_session=spark))
return best["alpha"]
data_small = gen_data(10 * 1024 * 1024) # ~10MB
def objective_small(alpha):
# For small data, you might reference it directly.
return train_and_eval(data_small, alpha)
tune_alpha(objective_small)
Parallelism (8) is greater than the current total of Spark task slots
(1). If dynamic allocation is enabled, you might see more executors
allocated.

if you are in cluster: The core in Spark nomenclature is unrelated to the physical core in your CPU here with spark.executor.cores you specified the maximum number of thread(=task) each executor(you have one here) can run is 8 if you want to increase the number of executors you have to use --num-executors in command-line or spark.executor.instances configuration property in your code.
I suggest try something like this configuration if you are in a yarn cluster
spark.conf.set("spark.dynamicAllocation.enabled", "true")
spark.conf.set("spark.executor.cores", 4)
spark.conf.set("spark.dynamicAllocation.minExecutors","2")
spark.conf.set("spark.dynamicAllocation.maxExecutors","10")
please consider above options are not available in local mode
local: in local mode you only have one executor and if you want to change the number of its worker threads (which is one by default) you have to set your master like this local[*] or local[16]

Related

How long does load_dataset take time in huggingface?

I want to pre-train a T5 model using huggingface. The first step is training the tokenizer with this code:
import datasets
from t5_tokenizer_model import SentencePieceUnigramTokenizer
vocab_size = 32_000
input_sentence_size = None
# Initialize a dataset
dataset = datasets.load_dataset("oscar", name="unshuffled_deduplicated_fa", split="train")
tokenizer = SentencePieceUnigramTokenizer(unk_token="<unk>", eos_token="</s>", pad_token="<pad>")
# Build an iterator over this dataset
def batch_iterator(input_sentence_size=None):
if input_sentence_size is None:
input_sentence_size = len(dataset)
batch_length = 100
for i in range(0, input_sentence_size, batch_length):
yield dataset[i: i + batch_length]["text"]
# Train tokenizer
tokenizer.train_from_iterator(
iterator=batch_iterator(input_sentence_size=input_sentence_size),
vocab_size=vocab_size,
show_progress=True,
)
# Save files to disk
tokenizer.save("./persian-t5-base/tokenizer.json")
For the downloading part the message is:
Downloading and preparing dataset oscar/unshuffled_deduplicated_fa (download: 9.74 GiB, generated: 37.24 GiB, post-processed: Unknown size, total: 46.98 GiB) to /root/.cache/huggingface/datasets/oscar/unshuffled_deduplicated_fa/1.0.0/...
I am running it on Google Colab Pro (with High Ram setting and on TPU). However, it's about 2 hours and the execution line is still on load_datset
what is doing? is it normal for load_dataset to take so much time? Should I interrupt it an run it again?

BoostedTreeClassifier gets stuck on loss on the first step

I'm trying to run a simple boostedTreeClassifier on my dataset from the example, but it seems to get stuck on first step:
2019-06-28 11:20:31.658689: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:111] Filling up shuffle buffer (this may take a while): 84090 of 85873
2019-06-28 11:20:32.908425: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:162] Shuffle buffer filled.
I0628 11:20:34.904214 140220602029888 basic_session_run_hooks.py:262] loss = 0.6931464, step = 0
W0628 11:21:03.421219 140220602029888 basic_session_run_hooks.py:724] It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 0 vs previous value: 0. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
W0628 11:21:05.555618 140220602029888 basic_session_run_hooks.py:724] It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 0 vs previous value: 0. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
The same dataset seems to work fine when I pass it to other keras based model or xgboost model.
Here's the relevant code:
def make_input_fn(self, X, y, shuffle=True, num_epochs=None):
num_samples = len(self.y_train)
def input_fn():
dataset = tf.data.Dataset.from_tensor_slices((dict(X), y))
if shuffle:
dataset = dataset.shuffle(num_samples).repeat(num_epochs).batch(self.batch_size)
else:
dataset = dataset.repeat(num_epochs).batch(self.batch_size)
return dataset
return input_fn
def ens_train(self):
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.DEBUG)
train_input_fn = self.make_input_fn(self.X_train, self.y_train, num_epochs=self.epochs)
self.model = tf.estimator.BoostedTreesClassifier(self.feature_columns,
n_batches_per_layer = int(0.5* len(self.y_train)/self.batch_size),
model_dir = self.ofolder,
max_depth = 10,
n_trees = 1000)
self.model.train(train_input_fn, max_steps = 1000)

Was able to get a result by playing around with learning rates and number of epochs. The "best" parameters obtained by hyperparameter tuning on xgboost doesn't give similar results in BoostedTreeClassifier. It took a large number of epochs to get around 84% accuracy (balanced dataset). xgboost had given 95% without even hyperparameter tuning..

Is sklearn learning_curve function supported by dask?

I'm computing learning curves out of random forests using sklearn. I need to do it for lot of different RFs, therefore I want to use a cluster and Dask to reduce the time of the RFs fits.
Currently I implemented the following algorithm:
from sklearn.externals import joblib
from dask.distributed import Client, LocalCluster
worker_kwargs = dict(memory_limit="2GB", ncores=4)
cluster = LocalCluster(n_workers=4, threads_per_worker=2, **worker_kwargs) # processes=False?
client = Client(cluster)
X, Y = ..., ...
estimator = RandomForestRegressor(n_jobs=-1, **rf_params)
cv = ShuffleSplit(n_splits=5, test_size=0.2)
train_sizes = [...] # 20 different values
with joblib.parallel_backend('dask', scatter=[X,Y]):
train_sizes, train_scores, test_scores = learning_curve(estimator, X, Y, cv=cv, n_jobs=-1, train_sizes=train_sizes)
Here are 2 levels of parallelism:
One for the fitting of a RF (n_jobs=-1)
One for the looping over all the training set sizes (n_jobs=-1)
My problem is: if the backend is loky, then it takes around 23s.
[Parallel(n_jobs=-1)]: Done 50 out of 50 | elapsed: 22.8s finished
Now, if the backend is dask, then it takes more time:
[Parallel(n_jobs=-1)]: Done 50 out of 50 | elapsed: 30.3s finished
I now that Dask introduces overhead, but I don't except that this explain all the difference of running time.
dask is being developed quickly and I find a lot of different versions to do the same thing, without knowing which one is up-to-date.

Apache Spark in Yarn not using all cores [duplicate]

I am running Spark on my local machine (16G,8 cpu cores). I was trying to train linear regression model on dataset of size 300MB. I checked the cpu statistics and also the programs running, it just executes one thread.
The documentation says they have implemented distributed version of SGD.
http://spark.apache.org/docs/latest/mllib-linear-methods.html#implementation-developer
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel
from pyspark import SparkContext
def parsePoint(line):
values = [float(x) for x in line.replace(',', ' ').split(' ')]
return LabeledPoint(values[0], values[1:])
sc = SparkContext("local", "Linear Reg Simple")
data = sc.textFile("/home/guptap/Dropbox/spark_opt/test.txt")
data.cache()
parsedData = data.map(parsePoint)
model = LinearRegressionWithSGD.train(parsedData)
valuesAndPreds = parsedData.map(lambda p: (p.label,model.predict(p.features)))
MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))
model.save(sc, "myModelPath")
sameModel = LinearRegressionModel.load(sc, "myModelPath")

I think what you want to do is explicitly state the number of cores to use with the local context. As you can see from the comments here, "local" (which is what you're doing) instantiates a context on one thread whereas "local[4]" will run with 4 cores. I believe you can also use "local[*]" to run on all cores on your system.

Accelerating the prediction

Prediction with the SVM model created with 5 features and 3000 samples using default parameters is taking unexpectedely longer time (more than hour) with 5 features and 100000 samples. Is there way of accelerating the prediction?

A few issues to consider here:
Have you standardized your input matrix X? SVM is not scale-invariant, so it could be difficult for the algo to do classification if they takes a large number of raw inputs without proper scaling.
The choice of parameter C: Higher C allows a more complicated non-smooth decision boundary and it takes much more time to fit under this complexity. So decreasing the value C from default 1 to a lower value could accelerate the process.
It's also recommended to choose a proper value of gamma. This could be done via Grid-Search-Cross-Validation.
Here is the code to do grid-search cross validation. I ignore the test set here for simplicity.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import accuracy_score, recall_score, f1_score, roc_auc_score, make_scorer
# generate some artificial data
X, y = make_classification(n_samples=3000, n_features=5, weights=[0.1, 0.9])
# make a pipeline for convenience
pipe = make_pipeline(StandardScaler(), SVC(kernel='rbf', class_weight='auto'))
# set up parameter space, we want to tune SVC params C and gamma
# the range below is 10^(-5) to 1 for C and 0.01 to 100 for gamma
param_space = dict(svc__C=np.logspace(-5,0,5), svc__gamma=np.logspace(-2, 2, 10))
# choose your customized scoring function, popular choices are f1_score, accuracy_score, recall_score, roc_auc_score
my_scorer = make_scorer(roc_auc_score, greater_is_better=True)
# construct grid search
gscv = GridSearchCV(pipe, param_space, scoring=my_scorer)
gscv.fit(X, y)
# what's the best estimator
gscv.best_params_
Out[20]: {'svc__C': 1.0, 'svc__gamma': 0.21544346900318834}
# what's the best score, in our case, roc_auc_score
gscv.best_score_
Out[22]: 0.86819366014152421
Note: the SVC is still not running very fast. It takes more than 40s to compute 50 possible combinations of params.
%time gscv.fit(X, y)
CPU times: user 42.6 s, sys: 959 ms, total: 43.6 s
Wall time: 43.6 s

Because the number of features is relatively low, I would start with decreasing the penalty parameter. It controls the penalty for mislabeled samples in the train data, and as your data contains 5 features, I guess it is not exactly linearly separable.
Generally, this parameter (C) allows the classifier to have larger margin on the account of higher accuracy (see this for more information)
By default, C=1.0. Start with svm = SVC(C=0.1) and see how it goes.

One reason might be that the parameter gamma is not the same.
By default sklearn.svm.SVC uses RBF kernel and gamma is 0.0, in which case 1/n_features will be used instead. So gamma is different given different number of features.
In terms of suggestions, I agree with Jianxun's answer.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Setting task slots with pyspark on an individual machine - apache-spark

Related

How long does load_dataset take time in huggingface?

BoostedTreeClassifier gets stuck on loss on the first step

Is sklearn learning_curve function supported by dask?

Apache Spark in Yarn not using all cores [duplicate]

Accelerating the prediction

Categories

Resources