How to train word2vec model efficiently in the spark cluster environment? - apache-spark

I want to train word2vec model about 10G news corpus on my Spark cluster.
The following is the configration of my spark cluster:
One Master and 4 Worker
each with 80G memory and 24 Cores
However I find training Word2vec using Spark Mllib does't take full advantage of the cluster's resource.
For example:
the pic of top command in ubuntu
As the above picture shows,only 100% cpu is used in a worker,the other three worker is not in use(so not paste the their picture) and Just now I how trained a word2vec model about 2G news corpus,It takes about 6h,So I want to know how to train the model more efficiently?Thank everyone in advance:)
UPDATE1:the following command is what I used in the spark-shell
how to start spark-shell
spark-shell \
--master spark://ip:7077 \
--executor-memory 70G \
--driver-memory 70G \
--conf spark.akka.frameSize=2000 \
--conf spark.driver.maxResultSize=0 \
--conf spark.default.parallelism=180
the following command is what I used to train word2vec model in the spark-shell:
//import related packages
import org.apache.spark._
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
//read about 10G newsdata corpus
val newsdata = sc.textFile("hdfs://ip:9000/user/bd/newsdata/*",600).map(line => line.split(" ").toSeq)
//Configure word2vec parameters
val word2vec = new Word2Vec()
word2vec.setMinCount(10)
word2vec.setNumIterations(10)
word2vec.setVectorSize(200)
//train the model
val model = word2vec.fit(newsdata)
UPDATE2:
I have train the model for about 24h and it doesn't complete. The cluster is running like this:
only 100% cpu is used in a worker,the other three worker is not in use as before.

I experienced a similar problem in Python when training a Word2Vec model. Looking at the PySpark docs for word2vec here, it reads:
setNumIterations(numIterations) Sets number of iterations
(default: 1), which should be smaller than or equal to number of
partitions.
New in version 1.2.0.
setNumPartitions(numPartitions)Sets number of partitions
(default: 1). Use a small number for accuracy.
New in version 1.2.0.
My word2vec model stopped hanging, and Spark stopped running out of memory when I increased the number of partitions used by the model so that numIterations <= numPartitions
I suggest you set word2vec.setNumIterations(1) or word2vec.setNumPartitions(10).

As your model is taking too long to train, I think you should first try and understand how spark actually benefits the model training part. As per this paper,
Many common machine learning algorithms apply a function repeatedly to the same dataset to optimize a parameter (e.g., through gradient descent). While each iteration can be expressed as a MapReduce/Dryad job, each job must reload the data from disk, incurring a significant performance penalty
Spark mllib's libraries remove this performance penalty by caching the data in memory during the first iteration. So subsequent iterations are extremely quick compared to the first iteration and hence, there is a significant reduction in model training time. I think, in your case, the executor memory might be insufficient to load a partition of data in memory. Hence contents would be spilled to disk and would need to be fetched from disk again in every iteration, thus killing any performance benefits of spark. To make sure, this is actually the case, you should try and look at the executor logs which would contain some lines like "Unable to store rdd_x_y in memory".
If this is indeed the case, you'll need to adjust --num-executors, --executor-memory and numPartitions to see which values of these parameters are able to load the entire data into memory. You can try out with a small data set, single executor and a small value of executor memory on your local machine and analyze logs while incrementally increasing executor memory to see at which config the data is totally cached in memory. Once you have the configs for the small data set, you can do the Maths to figure out how many executors with how much memory are required and what should be the number of partitions for the required partition size.
I had faced a similar problem and managed to bring down model training time from around 4 hours to 20 minutes by following the above steps.

Related

PySpark PandasUDF on GCP - Memory Allocation

I am using a pandas udf to train many ML models on GCP in Dataproc (Spark). The main idea is that I have a grouping variable that represents the various sets of data in my data frame and I run something like this:
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def test_train(grp_df):
#train model on grp_df
#evaluate model
#return metrics on
return (metrics)
result=df.groupBy('group_id').apply(test_train)
This works fine except when I use the non-sampled data, where errors are returned that appear to be related to memory issues. The messages are cryptic (to me) but if I sample down the data it runs, if I dont, it fails. Error messages are things like:
OSError: Read out of bounds (offset = 631044336, size = 69873416) in
file of size 573373864
or
Container killed by YARN for exceeding memory limits. 24.5 GB of 24
GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead or disabling
yarn.nodemanager.vmem-check-enabled because of YARN-4714.
My Question is how to set memory in the cluster to get this to work?
I understand that each group of data and the process being ran needs to fit entirely in the memory of the executor. I current have a 4-worker cluster with the following:
If I think the maximum size of data in the largest group_id requires 150GB of memory, it seems I really need each machine to operate on one group_id at a time. At least I get 4 times the speed compared to having a single worker or VM.
If I do the following, is this in fact creating 1 executor per machine that has access to all the cores minus 1 and 180 GB of memory? So that if in theory the largest group of data would work on a single VM with this much RAM, this process should work?
spark = SparkSession.builder \
.appName('test') \
.config('spark.executor.memory', '180g') \
.config('spark.executor.cores', '63') \
.config('spark.executor.instances', '1') \
.getOrCreate()
Let's break the answer into 3 parts:
Number of executors
The GroupBy operation
Your executor memory
Number of executors
Straight from the Spark docs:
spark.executor.instances
Initial number of executors to run if dynamic allocation is enabled.
If `--num-executors` (or `spark.executor.instances`) is set and larger
than this value, it will be used as the initial number of executors.
So, No. You only get a single executor which won't scale up unless dynamic allocation is enabled.
You can increase the number of such executors manually by configuring spark.executor.instances or setup automatic scale up based on workload, by enabling dynamic executor allocation.
To enable dynamic allocation, you have to also enable the shuffle service which allows you to safely remove executors. This can be done by setting two configs:
spark.shuffle.service.enabled to true. Default is false.
spark.dynamicAllocation.enabled to true. Default is false.
GroupBy
I have observed group_by being done using hash aggregates in Spark which means given x number of partitions, and unique group_by values greater than x, multiple group by values will lie in the same partition.
For example, say two unique values in group_by column are a1 and a2 having total rows' size 100GiB and 150GiB respectively.
If they fall into separate partitions, your application will run fine since each partition will fit into the executor memory (180GiB), which is required for in-memory processing and the remaining will be spilled to disk if they do not fit into the remaining memory. However, if they fall into same partition, your partition will not fit into the executor memory (180GiB < 250GiB) and you will get an OOM.
In such instances, it's useful to configure spark.default.parallelism to distribute your data over a reasonably larger number of partitions or apply salting or other techniques to remove data skewness.
If your data is not too skewed, you are correct to say that as long as your executor can handle the largest groupby value, it should work since your data will be evenly partitioned and chances of the above happening will be rare.
Another point to note is that since you are using group_by which requires data shuffle, you should also turn on the shuffle service. Without the shuffle service, each executor has to serve the shuffle requests along with doing it's own work.
Executor memory
The total executor memory (actual executor container size) in Spark is determined by adding the executor memory alloted for container along with the alloted memoryOverhead. The memoryOverhead accounts for things like VM overheads, interned strings, other native overheads, etc. So,
Total executor memory = (spark.executor.memory + spark.executor.memoryOverhead)
spark.executor.memoryOverhead = max(executorMemory*0.10, 384 MiB)
Based on this, you can configure your executors to have an appropriate size as per your data.
So, when you set the spark.executor.memory to 180GiB, the actual executor launched should be of around 198GiB.
To Resolve yarn overhead issue you can increase yarn overhead memory by adding .config('spark.yarn.executor.memoryOverhead','30g') and for maximum parallelism it is recommended to keep no of cores to 5 where as you can increase the no of executors.
spark = SparkSession.builder \
.appName('test') \
.config('spark.executor.memory', '18g') \
.config('spark.executor.cores', '5') \
.config('spark.executor.instances', '12') \
.getOrCreate()
# or use dynamic resource allocation refer below config
spark = SparkSession.builder \
.appName('test') \
.config('spark.shuffle.service.enabled':'true')\
.config('spark.dynamicAllocation.enabled':'true')\
.getOrCreate()
I solved OSError: Read out of bounds ****
by making group number large
result=df.groupBy('group_id').apply(test_train)

How to speed up training Word2vec model with spark?

I am Using spark Word2vec API to build word vector. The code:
val w2v = new Word2Vec()
.setInputCol("words")
.setOutputCol("features")
.setMinCount(5)
But, this process is so slow. I check spark monitor web, there was two jobs to run long time.
My computer environment have 24 cores CPU and 100G memory, how to use them efficiently?
I would try increasing the amount of partitions in the dataframe that you are doing the feature extraction on. the stragglers are likely due to skew in the data causing most of the data to be processed by one node or core. If possible, distribute the data by logical partitioning, if not then create a random even distribution.

Spark: No space left on device when training tree-based model

I have a emr cluster with one master node and four worker nodes, each with 4 cores, 16gb RAM and four 80gb Hard Disks. My problem is when I tried to train a tree-based model such as Decision Tree, Random Forest or Gradient Boosting, the training process cannot be done and Yarn/Spark complained out of disk space error.
Here is the code that I tried to run:
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
rfr = DecisionTreeRegressor(featuresCol="features", labelCol="age")
cvModel = rfr.fit(age_training_data)
The process always stuck at MapPartitionsRDD map at BaggedPoint.scala step. I tried to run df -h command to examine the disk space. Before the process started, only 1 or 2% or space was used. But when I started the training process, 100% was used for at two hard disks. This problem occurs in not just Decision Tree, but also Random Forest and Gradient Boosting too. I was able to run Logistic Regression on the same dataset without any problem.
Highly appreciated if anyone knows the possible cause of this. Please take a look at the image of the stage which the error happens. Thank you!
UPDATE:Here is the setting of the spark-defaults.conf:
spark.history.fs.cleaner.enabled true
spark.eventLog.enabled true
spark.driver.extraLibraryPath /usr/lib/hadoop-current/lib/native
spark.executor.extraLibraryPath /usr/lib/hadoop-current/lib/native
spark.driver.extraJavaOptions -Dlog4j.ignoreTCL=true
spark.executor.extraJavaOptions -Dlog4j.ignoreTCL=true
spark.hadoop.yarn.timeline-service.enabled false
spark.driver.memory 8g
spark.yarn.driver.memoryOverhead 7g
spark.driver.cores 3
spark.executor.memory 10g
spark.yarn.executor.memoryOverhead 2048m
spark.executor.instances 4
spark.executor.cores 2
spark.default.parallelism 48
spark.yarn.max.executor.failures 32
spark.network.timeout 100000000s
spark.rpc.askTimeout 10000000s
spark.executor.heartbeatInterval 100000000s
spark.ui.view.acls *
#spark.serializer org.apache.spark.serializer.KryoSerializer
spark.executor.extraJavaOptions -XX:+UseG1GC
As for the data, it has around 120000000 records and the estimated size of the rdd is 1925055624 bytes (which I do not know if it is accurate as I follow here to get the estimated size).
I also found out the space is mostly used up by the temporary data blocks that are written to the machine during the training process, which seems to happen only when I used the tree-based model.

Spark cluster does not scale to small data

i am currently evaluating Spark 2.1.0 on a small cluster (3 Nodes with 32 CPUs and 128 GB Ram) with a benchmark in linear regression (Spark ML). I only measured the time for the parameter calculation (not including start, data loading, …) and recognized the following behavior. For small datatsets 0.1 Mio – 3 Mio datapoints the measured time is not really increasing and stays at about 40 seconds. Only with larger datasets like 300 Mio datapoints the processing time went up to 200 seconds. So it seems, the cluster does not scale at all to small datasets.
I also compared the small dataset on my local pc with the cluster using only 10 worker and 16GB ram. The processing time of the cluster is larger by a factor of 3. So is this considered normal behavior of SPARK and explainable by communication overhead or am I doing something wrong (or is linear regression not really representative)?
The cluster is a standalone cluster (without Yarn or Mesos) and the benchmarks where submitted with 90 worker, each with 1 core and 4 GB ram.
Spark submit:
./spark-submit --master spark://server:7077 --class Benchmark --deploy-mode client --total-executor-cores 90 --executor-memory 4g --num-executors 90 .../Benchmark.jar pathToData
The optimum cluster size and configuration varies based on the data and the nature of the job. In this case, I think that your intuition is correct, the job seems to take disproportionately longer to complete on smaller dataset, because of the excess overhead given the size of the cluster (cores and executors).
Notice that increasing the amount of data by two orders of magnitude increases the processing time only 5-fold. You are increasing the data toward an optimum size for your cluster setup.
Spark is a great tool for processing lots of data, but it isn't going to be competitive with running a single process on a single machine if the data will fit. However it can be much faster than other distributed processing tools that are disk-based, where the data does not fit on a single machine.
I was at a talk a couple years back and the speaker gave an analogy that Spark is like a locomotive racing a bicycle:- the bike will win if the load is light, it is quicker to accelerate and more agile, but with a heavy load the locomotive might take a while to get up to speed, but it's going to be faster in the end. (I'm afraid I forget the speakers name, but it was at a Cassandra meetup in London, and the speaker was from a company in the energy sector).
I agree with #ImDarrenG's assessment and generally also the locomotive/bicycle analogy.
With such a small amount of data, I would strongly recommend
A) caching the entire dataset and
B) broadcasting the dataset to each node (especially if you need to do something like your 300M row table join to the small datasets)
Another thing to consider is the # of files (if you're not already cached), because if you're reading in a single unsplittable file, only 1 core will be able to read that file in. However once you cache the dataset (coalescing or repartitioning as appropriate), performance will no longer be bound by disk/serializing the rows.

Tuning model fits in Spark ML

I'm fitting a large number of models in Pyspark via Spark ML (see: How best to fit many Spark ML models) and I'm wondering what I can do to speed up individual fits.
My data set is a spark data frame that's approximately 50gb, read in from libsvm format, and I'm running on a dynamically allocated YARN cluster with allocated executor memory = 10gb. Fitting a logistic regression classifier, it creates about 30 steps of treeAggregate at LogisticRegression.scala:1018, with alternating shuffle reads and shuffle writes of ~340mb each.
Executors come and go but it seems like the typical stage runtime is about 5 seconds. Is there anything I can look at to improve performance on these fits?
As a general job in Spark, you can do some stuff to improve your training time.
spark.driver.memory look out for your driver memory, some algorithms do shuffle data to your driver (in order to reduce computing time), so it might be a source of enhancement or at least one point of failure to keep an eye at.
Change the spark.executor.memory so it uses the maximum needed by the job but it also uses as little as much so you can fit more executors in each node (machine) on the cluster, and as you have more workers, you'll have more computer power to handle the job.
spark.sql.shuffle.partitions since you probably use DataFrames to manipulate data, try different values on this parameter so that you can execute more tasks per executor.
spark.executor.cores use it below 5 and you're good, above that, you probably will increase the time an executor has to handle the "shuffle" of tasks inside of it.
cache/persist: try to persist your data before huge transformations, if you're afraid of your executors not being able to handle it use StorageLevel.DISK_AND_MEMORY, so you're able to use both.
Important: all of this is based on my own experience alone training algorithms using Spark ML over datasets with 1TB-5TB and 30-50 features, I've researched to improve my own jobs but I'm not qualified as a source of truth for your problem. Learn more about your data and watch the logs of your executors for further enhancements.

Resources