I am Using spark Word2vec API to build word vector. The code:
val w2v = new Word2Vec()
But, this process is so slow. I check spark monitor web, there was two jobs to run long time.
My computer environment have 24 cores CPU and 100G memory, how to use them efficiently?

I would try increasing the amount of partitions in the dataframe that you are doing the feature extraction on. the stragglers are likely due to skew in the data causing most of the data to be processed by one node or core. If possible, distribute the data by logical partitioning, if not then create a random even distribution.


Do we need all the data in memory for running group by on Spark

I'm trying to run a group by operation on a huge data (around 50TB) something like this
df_grouped = df.groupby(df['col1'], df['col2']).sum('col3')
I'm using the dataframe API on Pyspark and running this on EMR with 12 r5.4xlarge machine. The job takes a long time to process and eventually killed with OOM.
My question is:
Is there any best practices on running group by operation with Spark?
Do we need all the data to fit in memory when running this?
The groupBy operation is not efficient for such large datasets. The OOM in groupBy indicates that there might be data skewness and this is because the groupBy implementation reads all the data in a partition in memory. You can take a look at the implementation here

Avoid chunk / batch processing in Spark

Often I am encountering a pattern of dividing Big processing steps in batches when these steps can't be processed entirely in our Big Data Spark cluster.
For instance, we have a large cross join or some calculus that fails when done with all the input data and then we usually are dividing these spark task in chunks so the spark mini-tasks can complete.
Particularly I doubt this is the right way to do it in Spark.
Is there a recipe to solve this issue? Or even with Spark we are again in the old-way of chunking/batching the work so to the work can be completed in a small cluster?
Is this a mere question of re-partitioning the input data so that Spark can do more sequential processing instead of parallel processing?

Tuning model fits in Spark ML

I'm fitting a large number of models in Pyspark via Spark ML (see: How best to fit many Spark ML models) and I'm wondering what I can do to speed up individual fits.
My data set is a spark data frame that's approximately 50gb, read in from libsvm format, and I'm running on a dynamically allocated YARN cluster with allocated executor memory = 10gb. Fitting a logistic regression classifier, it creates about 30 steps of treeAggregate at LogisticRegression.scala:1018, with alternating shuffle reads and shuffle writes of ~340mb each.
Executors come and go but it seems like the typical stage runtime is about 5 seconds. Is there anything I can look at to improve performance on these fits?
As a general job in Spark, you can do some stuff to improve your training time.
spark.driver.memory look out for your driver memory, some algorithms do shuffle data to your driver (in order to reduce computing time), so it might be a source of enhancement or at least one point of failure to keep an eye at.
Change the spark.executor.memory so it uses the maximum needed by the job but it also uses as little as much so you can fit more executors in each node (machine) on the cluster, and as you have more workers, you'll have more computer power to handle the job.
spark.sql.shuffle.partitions since you probably use DataFrames to manipulate data, try different values on this parameter so that you can execute more tasks per executor.
spark.executor.cores use it below 5 and you're good, above that, you probably will increase the time an executor has to handle the "shuffle" of tasks inside of it.
cache/persist: try to persist your data before huge transformations, if you're afraid of your executors not being able to handle it use StorageLevel.DISK_AND_MEMORY, so you're able to use both.
Important: all of this is based on my own experience alone training algorithms using Spark ML over datasets with 1TB-5TB and 30-50 features, I've researched to improve my own jobs but I'm not qualified as a source of truth for your problem. Learn more about your data and watch the logs of your executors for further enhancements.

Spark MLLIB parallelism multiple nodes

Can Machine learning algorithms provided by "spark mllib" like naive byes,random forest run in parallel mode across spark cluster? OR we need to change code? Kindly provide an example to run in parallel? Not sure how parallelism work (map) in MLLIB - as each processing requires entire training data set. Does computation run in parallel with subset of training data?
These algorithms as provided by Spark MLLib do run in parallel automatically. They expect an RDD as input. An RDD is a resilient distributed dataset, spread across a cluster of computers.
Here is an example problem using a Decision Tree for classification problems.
I highly recommend exploring in depth the link provided above. The page has extensive documentation and examples of how to code these algorithms, including generating training and testing datasets, scoring, cross validation, etc.
These algorithms run in parallel by running computations on the worker nodes' subset of the data, and then sharing the results of those computations across worker nodes and with the master node. The master node collects the results of individual computations and aggregates them as necessary to make decisions based on the entire dataset. Computation heavy activities are mostly executed on the worker nodes.

How to train word2vec model efficiently in the spark cluster environment?

I want to train word2vec model about 10G news corpus on my Spark cluster.
The following is the configration of my spark cluster:
One Master and 4 Worker
each with 80G memory and 24 Cores
However I find training Word2vec using Spark Mllib does't take full advantage of the cluster's resource.
For example:
the pic of top command in ubuntu
As the above picture shows,only 100% cpu is used in a worker,the other three worker is not in use(so not paste the their picture) and Just now I how trained a word2vec model about 2G news corpus,It takes about 6h,So I want to know how to train the model more efficiently?Thank everyone in advance:)
UPDATE1:the following command is what I used in the spark-shell
how to start spark-shell
spark-shell \
--master spark://ip:7077 \
--executor-memory 70G \
--driver-memory 70G \
--conf spark.akka.frameSize=2000 \
--conf spark.driver.maxResultSize=0 \
--conf spark.default.parallelism=180
the following command is what I used to train word2vec model in the spark-shell:
//import related packages
import org.apache.spark._
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
//read about 10G newsdata corpus
val newsdata = sc.textFile("hdfs://ip:9000/user/bd/newsdata/*",600).map(line => line.split(" ").toSeq)
//Configure word2vec parameters
val word2vec = new Word2Vec()
//train the model
val model =
I have train the model for about 24h and it doesn't complete. The cluster is running like this:
only 100% cpu is used in a worker,the other three worker is not in use as before.
I experienced a similar problem in Python when training a Word2Vec model. Looking at the PySpark docs for word2vec here, it reads:
setNumIterations(numIterations) Sets number of iterations
(default: 1), which should be smaller than or equal to number of
New in version 1.2.0.
setNumPartitions(numPartitions)Sets number of partitions
(default: 1). Use a small number for accuracy.
New in version 1.2.0.
My word2vec model stopped hanging, and Spark stopped running out of memory when I increased the number of partitions used by the model so that numIterations <= numPartitions
I suggest you set word2vec.setNumIterations(1) or word2vec.setNumPartitions(10).
As your model is taking too long to train, I think you should first try and understand how spark actually benefits the model training part. As per this paper,
Many common machine learning algorithms apply a function repeatedly to the same dataset to optimize a parameter (e.g., through gradient descent). While each iteration can be expressed as a MapReduce/Dryad job, each job must reload the data from disk, incurring a significant performance penalty
Spark mllib's libraries remove this performance penalty by caching the data in memory during the first iteration. So subsequent iterations are extremely quick compared to the first iteration and hence, there is a significant reduction in model training time. I think, in your case, the executor memory might be insufficient to load a partition of data in memory. Hence contents would be spilled to disk and would need to be fetched from disk again in every iteration, thus killing any performance benefits of spark. To make sure, this is actually the case, you should try and look at the executor logs which would contain some lines like "Unable to store rdd_x_y in memory".
If this is indeed the case, you'll need to adjust --num-executors, --executor-memory and numPartitions to see which values of these parameters are able to load the entire data into memory. You can try out with a small data set, single executor and a small value of executor memory on your local machine and analyze logs while incrementally increasing executor memory to see at which config the data is totally cached in memory. Once you have the configs for the small data set, you can do the Maths to figure out how many executors with how much memory are required and what should be the number of partitions for the required partition size.
I had faced a similar problem and managed to bring down model training time from around 4 hours to 20 minutes by following the above steps.
