(PySpark, either Spark 1.6 or 2.0, shared YARN cluster with dozens of nodes)
I'd like to run a bootstrapping analysis, with each boot strap sample running on a dataset that's too large to fit on a single executor.
The naive approach I was going to start with is:
create spark dataframe of training dataset
for i in (1,1000):
use df.sample() to create a sample_df
train the model (logistic classifier) on sample_df
Although each individual model is fit across the cluster, this doesn't seem to be very 'parallel' thinking.
Should I be doing this a different way?
Related
I am Using spark Word2vec API to build word vector. The code:
val w2v = new Word2Vec()
.setInputCol("words")
.setOutputCol("features")
.setMinCount(5)
But, this process is so slow. I check spark monitor web, there was two jobs to run long time.
My computer environment have 24 cores CPU and 100G memory, how to use them efficiently?
I would try increasing the amount of partitions in the dataframe that you are doing the feature extraction on. the stragglers are likely due to skew in the data causing most of the data to be processed by one node or core. If possible, distribute the data by logical partitioning, if not then create a random even distribution.
My platform is spark 2.1.0, using python language.
Now I have about 100 random forest multiclassification models ,I have saved them in the HDFS.There are 100 datasets saved in the HDFS too.
I want to predict the dataset using corresponding model.If the models and datasets are cache in memory,the predict will be more than 10 times faster.
But I do not know how to cache models because the model is not RDD or Dataframe.
Thanks!
TL;DR Just cache the data, if it is ever reused outside prediction process, and if not you can even skip that.
RandomForestModel is a local object not backed by distributed data structures, there is no DAG to recompute, and prediction process is a simple, map-only job. Therefore model cannot be cached and even if it could, the operation would be meaningless.
See also (Why) do we need to call cache or persist on a RDD
I have more than 100Gb of text data stored on s3 in multiple parquet file. I need to train a Word2Vec model on this. I tried using Spark but it run into memory error for more than 10GB data.
My next option is to train using TensorFlow on EMR. But I am unable to decide what should be right training strategy for such case? One big node or multiple small nodes, what should be size of that node. How tensorflow manage distributed data? Is batch training an option?
I'm fitting a large number of models in Pyspark via Spark ML (see: How best to fit many Spark ML models) and I'm wondering what I can do to speed up individual fits.
My data set is a spark data frame that's approximately 50gb, read in from libsvm format, and I'm running on a dynamically allocated YARN cluster with allocated executor memory = 10gb. Fitting a logistic regression classifier, it creates about 30 steps of treeAggregate at LogisticRegression.scala:1018, with alternating shuffle reads and shuffle writes of ~340mb each.
Executors come and go but it seems like the typical stage runtime is about 5 seconds. Is there anything I can look at to improve performance on these fits?
As a general job in Spark, you can do some stuff to improve your training time.
spark.driver.memory look out for your driver memory, some algorithms do shuffle data to your driver (in order to reduce computing time), so it might be a source of enhancement or at least one point of failure to keep an eye at.
Change the spark.executor.memory so it uses the maximum needed by the job but it also uses as little as much so you can fit more executors in each node (machine) on the cluster, and as you have more workers, you'll have more computer power to handle the job.
spark.sql.shuffle.partitions since you probably use DataFrames to manipulate data, try different values on this parameter so that you can execute more tasks per executor.
spark.executor.cores use it below 5 and you're good, above that, you probably will increase the time an executor has to handle the "shuffle" of tasks inside of it.
cache/persist: try to persist your data before huge transformations, if you're afraid of your executors not being able to handle it use StorageLevel.DISK_AND_MEMORY, so you're able to use both.
Important: all of this is based on my own experience alone training algorithms using Spark ML over datasets with 1TB-5TB and 30-50 features, I've researched to improve my own jobs but I'm not qualified as a source of truth for your problem. Learn more about your data and watch the logs of your executors for further enhancements.
Can Machine learning algorithms provided by "spark mllib" like naive byes,random forest run in parallel mode across spark cluster? OR we need to change code? Kindly provide an example to run in parallel? Not sure how parallelism work (map) in MLLIB - as each processing requires entire training data set. Does computation run in parallel with subset of training data?
Thanks
These algorithms as provided by Spark MLLib do run in parallel automatically. They expect an RDD as input. An RDD is a resilient distributed dataset, spread across a cluster of computers.
Here is an example problem using a Decision Tree for classification problems.
I highly recommend exploring in depth the link provided above. The page has extensive documentation and examples of how to code these algorithms, including generating training and testing datasets, scoring, cross validation, etc.
These algorithms run in parallel by running computations on the worker nodes' subset of the data, and then sharing the results of those computations across worker nodes and with the master node. The master node collects the results of individual computations and aggregates them as necessary to make decisions based on the entire dataset. Computation heavy activities are mostly executed on the worker nodes.