My platform is spark 2.1.0, using python language.
Now I have about 100 random forest multiclassification models ,I have saved them in the HDFS.There are 100 datasets saved in the HDFS too.
I want to predict the dataset using corresponding model.If the models and datasets are cache in memory,the predict will be more than 10 times faster.
But I do not know how to cache models because the model is not RDD or Dataframe.
Thanks!
TL;DR Just cache the data, if it is ever reused outside prediction process, and if not you can even skip that.
RandomForestModel is a local object not backed by distributed data structures, there is no DAG to recompute, and prediction process is a simple, map-only job. Therefore model cannot be cached and even if it could, the operation would be meaningless.
See also (Why) do we need to call cache or persist on a RDD
Related
I am Using spark Word2vec API to build word vector. The code:
val w2v = new Word2Vec()
.setInputCol("words")
.setOutputCol("features")
.setMinCount(5)
But, this process is so slow. I check spark monitor web, there was two jobs to run long time.
My computer environment have 24 cores CPU and 100G memory, how to use them efficiently?
I would try increasing the amount of partitions in the dataframe that you are doing the feature extraction on. the stragglers are likely due to skew in the data causing most of the data to be processed by one node or core. If possible, distribute the data by logical partitioning, if not then create a random even distribution.
I have more than 100Gb of text data stored on s3 in multiple parquet file. I need to train a Word2Vec model on this. I tried using Spark but it run into memory error for more than 10GB data.
My next option is to train using TensorFlow on EMR. But I am unable to decide what should be right training strategy for such case? One big node or multiple small nodes, what should be size of that node. How tensorflow manage distributed data? Is batch training an option?
When I run random forest in MLlib in Spark client mode, I noticed that even with the same random seeds, the results are different every time. I guess the root cause is that when Spark inputs the data from HDFS using sc.textFile, the distribution of the data to different executors are random.
Therefore, even after I fix the seeds for the random forest, the results are different since the data itself is shuffled differently every time. Is that correct? Is it possible to get the same result with the same seed? Thank you!
Can Machine learning algorithms provided by "spark mllib" like naive byes,random forest run in parallel mode across spark cluster? OR we need to change code? Kindly provide an example to run in parallel? Not sure how parallelism work (map) in MLLIB - as each processing requires entire training data set. Does computation run in parallel with subset of training data?
Thanks
These algorithms as provided by Spark MLLib do run in parallel automatically. They expect an RDD as input. An RDD is a resilient distributed dataset, spread across a cluster of computers.
Here is an example problem using a Decision Tree for classification problems.
I highly recommend exploring in depth the link provided above. The page has extensive documentation and examples of how to code these algorithms, including generating training and testing datasets, scoring, cross validation, etc.
These algorithms run in parallel by running computations on the worker nodes' subset of the data, and then sharing the results of those computations across worker nodes and with the master node. The master node collects the results of individual computations and aggregates them as necessary to make decisions based on the entire dataset. Computation heavy activities are mostly executed on the worker nodes.
I'm trying to train the MLlib RandomForestRegression Model using the RandomForest.trainRegressor API.
After training, when I try to save the model the resulting model folder has a size of 6.5MB on disk, but there are 1120 small parquet files in the data folder that seem to be unnecessary and slow to upload/download to s3.
Is this the expected behavior? I'm certainly repartitioning the labeledPoints to have 1 partition but this is happening regardless.
Repartitioning with rdd.repartition(1) before training does not help much. It make training potentially slower because all parallel operation are effectively sequential as whole parallelism stuff is based on partitions.
Instead of that I've come up with simple hack and set spark.default.parallelism to 1 as the save procedure use sc.parallelize method to create stream to save.
Keep in mind that it will affect countless places in your app like groupBy and join. My suggestion is to extract train&save of the model to separate application and run it in isolation.