spark java.lang.stackoverflow logistic regression fit with large dataset

spark java.lang.stackoverflow logistic regression fit with large dataset - apache-spark

I am trying to fit a logistic regression model for a data set with 470 features and 10 million training instances. Here is a snippet of my code.
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import RFormula
formula = RFormula(formula = "label ~ .-classWeight")
bestregLambdaVal = 0.005
bestregAlphaVal = 0.01
lr = LogisticRegression(maxIter=1000, regParam=bestregLambdaVal, elasticNetParam=bestregAlphaVal,weightCol="classWeight")
pipeLineLr = Pipeline(stages = [formula, lr])
pipeLineFit = pipeLineLr.fit(mySparkDataFrame[featureColumnNameList + ['classWeight','label']])
I have also created a checkpoint directory,
sc.setCheckpointDir('checkpoint/')
as suggested here:
Spark gives a StackOverflowError when training using ALS
However I get an error and here is a partial trace:
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 64, in fit
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 108, in _fit
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 64, in fit
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 265, in _fit
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 262, in _fit_java
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o383361.fit.
: java.lang.StackOverflowError
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
I would also like to note that the 470 feature columns were iteratively added to spark data frame using withcolumn().

So the mistake I was making is that, when checkpointing the dataframe, I would only do:
mySparkDataFrame.checkpoint(eager=True)
The right was to do:
mySparkDataFrame = mySparkDataFrame.checkpoint(eager=True)
This is based on another question I had asked (and got an answer for) here:
pyspark rdd isCheckPointed() is false
Also, it is recommended to persist() the dataframe before checkpoint and also to count() it after the checkpoint

Related

RuntimeError: can't start new thread

My objective was to use Scikit-Optimize library in python to minimize the function value in order to find the optimized parameters for xgboost model. The process involve running the model with different random parameters for 5,000 times.
However, it seems that the loop stopped at some point and gave me a RuntimeError: can't start new thread. I am using ubuntu 20.04 and is running python 3.8.5, Scikit-Optimize version is 0.8.1. I ran the same code in windows 10 and it appears that I do not encounter this RuntimeError, however, the code runs much more slower.
I think I may need a threadpool to solve this issue but after searching through the web and I had no luck on finding a solution to implement the threadpool.
Below is a simplified version of the codes:
#This function will be passed to Scikit-Optimize to find the optimized parameters (Params)
def find_best_xgboost_para(params):`
#Defines the parameters that I want to optimize
learning_rate,gamma,max_depth,min_child_weight,reg_alpha,reg_lambda,subsample,max_bin,num_parallel_tree,colsamp_lev,colsamp_tree,StopSteps\
=float(params[0]),float(params[1]),int(params[2]),int(params[3]),\
int(params[4]),int(params[5]),float(params[6]),int(params[7]),int(params[8]),float(params[9]),float(params[10]),int(params[11])
xgbc=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=colsamp_lev,
colsample_bytree=colsamp_tree, gamma=gamma, learning_rate=learning_rate, max_delta_step=0,
max_depth=max_depth, min_child_weight=min_child_weight, missing=None, n_estimators=nTrees,
objective='binary:logistic',random_state=101, reg_alpha=reg_alpha,
reg_lambda=reg_lambda, scale_pos_weight=1,seed=101,
subsample=subsample,importance_type='gain',gpu_id=GPUID,max_bin=max_bin,
tree_method='gpu_hist',num_parallel_tree=num_parallel_tree,predictor='gpu_predictor',verbosity=0,\
refresh_leaf=0,grow_policy='depthwise',process_type=TreeUpdateStatus,single_precision_histogram=SinglePrecision)
tscv = TimeSeriesSplit(CV_nSplit)
error_data=xgboost.cv(xgbc.get_xgb_params(), CVTrain, num_boost_round=CVBoostRound, nfold=None, stratified=False, folds=tscv, metrics=(), \
obj=None, feval=f1_eval, maximize=False, early_stopping_rounds=StopSteps, fpreproc=None, as_pandas=True, \
verbose_eval=True, show_stdv=True, seed=101, shuffle=shuffle_trig)
eval_set = [(X_train, y_train), (X_test, y_test)]
xgbc.fit(X_train, y_train, eval_metric=f1_eval, early_stopping_rounds=StopSteps, eval_set=eval_set,verbose=True)
xgbc_predictions=xgbc.predict(X_test)
error=(1-metrics.f1_score(y_test, xgbc_predictions,average='macro'))
del xgbc
return error
#Define the range of values that Scikit-Optimize can choose from to find the optimized parameters
lr_low, lr_high=float(XgParamDict['lr_low']), float(XgParamDict['lr_high'])
gama_low, gama_high=float(XgParamDict['gama_low']), float(XgParamDict['gama_high'])
depth_low, depth_high=int(XgParamDict['depth_low']), int(XgParamDict['depth_high'])
child_weight_low, child_weight_high=int(XgParamDict['child_weight_low']), int(XgParamDict['child_weight_high'])
alpha_low,alpha_high=int(XgParamDict['alpha_low']),int(XgParamDict['alpha_high'])
lambda_low,lambda_high=int(XgParamDict['lambda_low']),int(XgParamDict['lambda_high'])
subsamp_low,subsamp_high=float(XgParamDict['subsamp_low']),float(XgParamDict['subsamp_high'])
max_bin_low,max_bin_high=int(XgParamDict['max_bin_low']),int(XgParamDict['max_bin_high'])
num_parallel_tree_low,num_parallel_tree_high=int(XgParamDict['num_parallel_tree_low']),int(XgParamDict['num_parallel_tree_high'])
colsamp_lev_low,colsamp_lev_high=float(XgParamDict['colsamp_lev_low']),float(XgParamDict['colsamp_lev_high'])
colsamp_tree_low,colsamp_tree_high=float(XgParamDict['colsamp_tree_low']),float(XgParamDict['colsamp_tree_high'])
StopSteps_low,StopSteps_high=float(XgParamDict['StopSteps_low']),float(XgParamDict['StopSteps_high'])
#Pass the target function (find_best_xgboost_para) as well as parameter ranges to Scikit-Optimize, 'res' will be an array of values that will need to be pass to another function
res=gbrt_minimize(find_best_xgboost_para,[(lr_low,lr_high),(gama_low, gama_high),(depth_low,depth_high),(child_weight_low,child_weight_high),\
(alpha_low,alpha_high),(lambda_low,lambda_high),(subsamp_low,subsamp_high),(max_bin_low,max_bin_high),\
(num_parallel_tree_low,num_parallel_tree_high),(colsamp_lev_low,colsamp_lev_high),(colsamp_tree_low,colsamp_tree_high),\
(StopSteps_low,StopSteps_high)],random_state=101,n_calls=5000,n_random_starts=1500,verbose=True,n_jobs=-1)
Below is the error message:
Traceback (most recent call last):
File "/home/FactorOpt.py", line 91, in <module>Opt(**FactorOptDict)
File "/home/anaconda3/lib/python3.8/site-packages/skopt/optimizer/gbrt.py", line 179, in gbrt_minimize return base_minimize(func, dimensions, base_estimator,
File "/home/anaconda3/lib/python3.8/site-packages/skopt/optimizer/base.py", line 301, in base_minimize
next_y = func(next_x)
File "/home/anaconda3/lib/python3.8/modelling/FactorOpt.py", line 456, in xgboost_opt
res=gbrt_minimize(find_best_xgboost_para,[(lr_low,lr_high),(gama_low, gama_high),(depth_low,depth_high),(child_weight_low,child_weight_high),\
File "/home/anaconda3/lib/python3.8/site-packages/skopt/optimizer/gbrt.py", line 179, in gbrt_minimize
return base_minimize(func, dimensions, base_estimator,
File "/home/anaconda3/lib/python3.8/site-packages/skopt/optimizer/base.py", line 302, in base_minimize
result = optimizer.tell(next_x, next_y)
File "/home/anaconda3/lib/python3.8/site-packages/skopt/optimizer/optimizer.py", line 493, in tell
return self._tell(x, y, fit=fit)
File "/home/anaconda3/lib/python3.8/site-packages/skopt/optimizer/optimizer.py", line 536, in _tell
est.fit(self.space.transform(self.Xi), self.yi)
File "/home/anaconda3/lib/python3.8/site-packages/skopt/learning/gbrt.py", line 85, in fit
self.regressors_ = Parallel(n_jobs=self.n_jobs, backend='threading')(
File "/home/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 1048, in __call__
if self.dispatch_one_batch(iterator):
File "/home/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 866, in dispatch_one_batch
self._dispatch(tasks)
File "/home/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 784, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/home/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 252, in apply_async
return self._get_pool().apply_async(
File "/home/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 407, in _get_pool
self._pool = ThreadPool(self._n_jobs)
File "/home/anaconda3/lib/python3.8/multiprocessing/pool.py", line 925, in __init__
Pool.__init__(self, processes, initializer, initargs)
File "/home/anaconda3/lib/python3.8/multiprocessing/pool.py", line 232, in __init__
self._worker_handler.start()
File "/home/anaconda3/lib/python3.8/threading.py", line 852, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread

TensorFlow 2.1 using TPUEstimator: RuntimeError: All tensors outfed from TPU should preserve batch size dimension, but got scalar Tensor

I just converted an existing project from TF 1.14 to TF 2.1 which uses the TPUEstimator API. After making the conversion, testing locally (i.e. use_tpu=False) runs successfully. However, I am getting errors when running on Google Cloud TPU (i.e. use_tpu=True).
Note: This is in the context of the AdaNet AutoML framework (v0.8.0), although I suspect this may be a general TPUEstimator-related error, as the errors appear to originate in the tpu_estimator.py and error_handling.py scripts seen in the Traceback below:
File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3032, in train
rendezvous.record_error('training_loop', sys.exc_info())
File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 81, in record_error
if value and value.op and value.op.type == _CHECK_NUMERIC_OP_NAME:
AttributeError: 'RuntimeError' object has no attribute 'op'
During handling of the above exception, another exception occurred:
File "workspace/trainer/train.py", line 331, in <module>
main(args=parsed_args)
File "workspace/trainer/train.py", line 177, in main
run_config=run_config)
File "workspace/trainer/train.py", line 68, in run_experiment
estimator.train(input_fn=train_input_fn, max_steps=total_train_steps)
File "/usr/local/lib/python3.6/site-packages/adanet/core/estimator.py", line 853, in train
saving_listeners=saving_listeners)
File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
rendezvous.raise_errors()
File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 143, in raise_errors
six.reraise(typ, value, traceback)
File "/usr/local/lib/python3.6/site-packages/six.py", line 703, in reraise
raise value
File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
saving_listeners=saving_listeners)
File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 374, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1164, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1194, in _train_model_default
features, labels, ModeKeys.TRAIN, self.config)
File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2857, in _call_model_fn
config)
File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1152, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3186, in _model_fn
host_ops = host_call.create_tpu_hostcall()
File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2226, in create_tpu_hostcall
'dimension, but got scalar {}'.format(dequeue_ops[i][0]))
RuntimeError: All tensors outfed from TPU should preserve batch size dimension, but got scalar Tensor("OutfeedDequeueTuple:1", shape=(), dtype=int64, device=/job:tpu_worker/task:0/device:CPU:0)'
The previous version of the project using TF 1.14 runs both locally and on TPU using TPUEstimator without issues. Is there something obvious I am potentially missing for the conversion over to TF 2.1 when using TPUEstimator API?

Have you applied the following:
dataset = ...
dataset = dataset.apply(tf.contrib.data.batch_and_drop_remainder(batch_size))
this potentially drops the last few samples from a file to ensure that every batch has a static shape of batch_size, which is required when training on TPUs.

ValueError: negative dimensions are not allowed when loading .pkl file

Although there are many question threads for error ValueError: negative dimensions are not allowed
I couldn't find the answer for my problem
After training Machine learning model using SGDclassifer
clf=linear_model.SGDClassifier(loss='log',random_state=20000,verbose=1,class_weight='balanced')
model=clf.fit(X,Y)
Dimension of X is (1651880,246177)
The below code is working i.e when saving model object and when using model for prediction
joblib.dump(model, 'trainedmodel.pkl',compress=3)
prediction_result=model.predict(x_test)
but getting error when loading the saved model
model = joblib.load('trainedmodel.pkl')
below is the error message
Please help me out to resolve it.
File "C:\Users\Taxonomy\AppData\Roaming\Python\Python36\site-packages\sklearn\externals\joblib\numpy_pickle.py", line 598, in load
obj = _unpickle(fobj, filename, mmap_mode)
File "C:\Users\Taxonomy\AppData\Roaming\Python\Python36\site-packages\sklearn\externals\joblib\numpy_pickle.py", line 526, in _unpickle
obj = unpickler.load()
File "C:\Users\Taxonomy\Anaconda3\lib\pickle.py", line 1050, in load
dispatch[key[0]](self)
File "C:\Users\Taxonomy\AppData\Roaming\Python\Python36\site-packages\sklearn\externals\joblib\numpy_pickle.py", line 352, in load_build
self.stack.append(array_wrapper.read(self))
File "C:\Users\Taxonomy\AppData\Roaming\Python\Python36\site-packages\sklearn\externals\joblib\numpy_pickle.py", line 195, in read
array = self.read_array(unpickler)
File "C:\Users\Taxonomy\AppData\Roaming\Python\Python36\site-packages\sklearn\externals\joblib\numpy_pickle.py", line 141, in read_array
array = unpickler.np.empty(count, dtype=self.dtype)
ValueError: negative dimensions are not allowed

Try to dump model with protocol 4.
from python's pickle docs:
Protocol version 4 was added in Python 3.4. It adds support for very
large objects, pickling more kinds of objects, and some data format
optimizations. Refer to PEP 3154 for information about improvements
brought by protocol 4.

I am getting error "Restoring from checkpoint failed." while training tensorflow estimator api on AI-platform(ml-engine)

I am trying to do hyperparameter tuning on ai-engine for DNN regressor using tensorflow estimator api. But after submitting the job, it shows that job is failed and I get this error in job details.
Can someone help?
Hyperparameter Tuning Trial #1 Failed before any other successful trials were completed. The failed trial had parameters: learning_rate=0.0019937718716419557, num-layers=2, first-layer-size=148, scale-factor=0.7910729020312588, . The trial's error message was: The replica master 0 exited with a non-zero status of 1.
Traceback (most recent call last):
[...]
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 507, in _build_internal
restore_sequentially, reshape)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 385, in _AddShardedRestoreOps
name="restore_shard"))
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 332, in _AddRestoreOps
restore_sequentially)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 580, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1572, in restore_v2
name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
tensor_name = dnn/hiddenlayer_0/bias; shape in shape_and_slice spec [148] does not match the shape stored in checkpoint: [117]
[[node save/RestoreV2_1 (defined at /usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1403) ]]

Looks like you are using the same output directory for all the trials, and so trial#1 is trying to read trial#2 checkpoint (perhaps because it is the latest one in the directory) and failing because the architecture is different
Make sure to use a different output directory for each hyperparam training run. There are two ways you do this:
Use the --job-dir as the output directory.
Append a hyperparam trial number to the output directory you are using now:
output_dir = os.path.join( output_dir, json.loads( os.environ.get('TF_CONFIG', '{}') ).get('task', {}).get('trial', '') )

Why does PCA in pyspark run out of memory?

When I run PCA in pyspark, I run out of memory. This is pyspark 1.6.3 and teh execution environment is a zeppelin notebook. Here is an example. Let df be a pyspark DataFrame where 'vectors' is the desired input column (containing a SparseVector of data).
from pyspark.ml.feature import PCA
pca = PCA(k = 100, inputCol="vectors", outputCol = "pca").fit(df)
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-2419389767585347468.py", line 360, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 2, in <module>
File "/usr/hdp/current/spark-client/python/pyspark/ml/pipeline.py", line 69, in fit
return self._fit(dataset)
File "/usr/hdp/current/spark-client/python/pyspark/ml/wrapper.py", line 133, in _fit
java_model = self._fit_java(dataset)
File "/usr/hdp/current/spark-client/python/pyspark/ml/wrapper.py", line 130, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/hdp/current/spark-client/python/pyspark/sql/utils.py", line 45, in deco
return f(*a, **kw)
File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o222.fit.
: java.lang.OutOfMemoryError: Java heap space
But check this out:
import pandas as pd
import numpy as np
pandf = df.toPandas()
densevectors = [np.array(sparse.toArray()) for sparse in pandf['vectors']]
xtrain = np.vstack(densevectors)
from sklearn.decomposition import PCA as skPCA
skpca = skPCA(n_components=100).fit(xtrain)
skpca.components_.shape
(100, 41277)
Execution time is 14 seconds. There are no memory problems, of course, because the input dataset only has ~9000 rows of sparse vectors. In spark-defaults.conf, driver and executor memory are both set at 12g, and this is an 8 node cluster that should have 32g available per node. There is no way the entire input dataset even takes up 1 MB, not even as a .csv format.
Why is pyspark's PCA implementation running out of memory?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

spark java.lang.stackoverflow logistic regression fit with large dataset - apache-spark

Related

RuntimeError: can't start new thread

TensorFlow 2.1 using TPUEstimator: RuntimeError: All tensors outfed from TPU should preserve batch size dimension, but got scalar Tensor

ValueError: negative dimensions are not allowed when loading .pkl file

I am getting error "Restoring from checkpoint failed." while training tensorflow estimator api on AI-platform(ml-engine)

Why does PCA in pyspark run out of memory?

Categories

Resources