Train Word2Vec on 100+GB of data - apache-spark

I have more than 100Gb of text data stored on s3 in multiple parquet file. I need to train a Word2Vec model on this. I tried using Spark but it run into memory error for more than 10GB data.
My next option is to train using TensorFlow on EMR. But I am unable to decide what should be right training strategy for such case? One big node or multiple small nodes, what should be size of that node. How tensorflow manage distributed data? Is batch training an option?

Related

How to speed up training Word2vec model with spark?

I am Using spark Word2vec API to build word vector. The code:
val w2v = new Word2Vec()
.setInputCol("words")
.setOutputCol("features")
.setMinCount(5)
But, this process is so slow. I check spark monitor web, there was two jobs to run long time.
My computer environment have 24 cores CPU and 100G memory, how to use them efficiently?
I would try increasing the amount of partitions in the dataframe that you are doing the feature extraction on. the stragglers are likely due to skew in the data causing most of the data to be processed by one node or core. If possible, distribute the data by logical partitioning, if not then create a random even distribution.

how to cache random forest models in spark

My platform is spark 2.1.0, using python language.
Now I have about 100 random forest multiclassification models ,I have saved them in the HDFS.There are 100 datasets saved in the HDFS too.
I want to predict the dataset using corresponding model.If the models and datasets are cache in memory,the predict will be more than 10 times faster.
But I do not know how to cache models because the model is not RDD or Dataframe.
Thanks!
TL;DR Just cache the data, if it is ever reused outside prediction process, and if not you can even skip that.
RandomForestModel is a local object not backed by distributed data structures, there is no DAG to recompute, and prediction process is a simple, map-only job. Therefore model cannot be cached and even if it could, the operation would be meaningless.
See also (Why) do we need to call cache or persist on a RDD

How to convert Neural network model (.h5) file to spark RDD file in python

I want to store the deep learning model data in spark environment as a RDD file and to load model from RDD file(either reverting the conversion) in Python. Can you give the possible way of doing it

How best to fit many Spark ML models

(PySpark, either Spark 1.6 or 2.0, shared YARN cluster with dozens of nodes)
I'd like to run a bootstrapping analysis, with each boot strap sample running on a dataset that's too large to fit on a single executor.
The naive approach I was going to start with is:
create spark dataframe of training dataset
for i in (1,1000):
use df.sample() to create a sample_df
train the model (logistic classifier) on sample_df
Although each individual model is fit across the cluster, this doesn't seem to be very 'parallel' thinking.
Should I be doing this a different way?

MLlib model (RandomForestModel) saves model with numerous small parquet files

I'm trying to train the MLlib RandomForestRegression Model using the RandomForest.trainRegressor API.
After training, when I try to save the model the resulting model folder has a size of 6.5MB on disk, but there are 1120 small parquet files in the data folder that seem to be unnecessary and slow to upload/download to s3.
Is this the expected behavior? I'm certainly repartitioning the labeledPoints to have 1 partition but this is happening regardless.
Repartitioning with rdd.repartition(1) before training does not help much. It make training potentially slower because all parallel operation are effectively sequential as whole parallelism stuff is based on partitions.
Instead of that I've come up with simple hack and set spark.default.parallelism to 1 as the save procedure use sc.parallelize method to create stream to save.
Keep in mind that it will affect countless places in your app like groupBy and join. My suggestion is to extract train&save of the model to separate application and run it in isolation.

Resources