spark task size too big - apache-spark

I'm using LBFGS logistic regression to classify examples into one of the two categories. When, I'm training the model, I get many warnings of this kind -
WARN scheduler.TaskSetManager: Stage 132 contains a task of very large size (109 KB). The maximum recommended task size is 100 KB.
WARN scheduler.TaskSetManager: Stage 134 contains a task of very large size (102 KB). The maximum recommended task size is 100 KB.
WARN scheduler.TaskSetManager: Stage 136 contains a task of very large size (109 KB). The maximum recommended task size is 100 KB.
I have about 94 features and about 7500 training examples. Is there some other argument I should pass in order to break up the task size into smaller chunks?
Also, is this just a warning that, in the worst case can be ignored? Or does it hamper the training?
I'm calling my trainer this way --
val lr_lbfgs = new LogisticRegressionWithLBFGS().setNumClasses(2)
lr_lbfgs.optimizer.setRegParam(reg).setNumIterations(numIterations)
val model = lr_lbfgs.run(trainingData)
Also, my driver and executor memory is 20G which I set as arguments to spark-submit

Spark sends a copy of every variable and method that needs to be visible to the executors; this warning means that, in total, these objects exceed 100 KB. You can safely ignore this warning if it doesn't impact performance noticeably, or you could consider marking some variables as broadcast variables.

Related

What contributes to spark driver maxResultSize limits?

In my Spark job, the results I am sending to the driver as barely a few KBs. I still got the below exception in spite of spark.driver.maxResultSize set to 4 GBs:
ERROR TaskSetManager: Total size of serialized results of 3021102 tasks (4.0 GB) is bigger than spark.driver.maxResultSize (4.0 GB)
Do Spark accumulators or anything else contribute towards memory usage from one allocated by spark.driver.maxResultSize? Is there an official documentation/code I can refer to to learn more on this?
More details about the code/execution:
There are 3 million tasks
Each task reads 50 files from S3 and re-writes them back to S3 post-transformation
Tasks return prefix of S3 files along with some metadata which is collected at the driver for saving to a file. This data is < 50 MBs
This issue has been fixed here: the cause is that when Spark calculates result size it actually also counts metadata(like metrics) in the task binary result sent back to driver. Therefore in case you have huge amount of tasks but collect almost nothing(the real data), you might still hit the error.

Performance issue with pyspark job

I am using pyspark / spark sql for performing very simple tasks. Data size is very less, highest being 215 MB. 90% of the data sources sizes are less than 15 MB. We do filtering, crunching and data aggregations and resultant data is also less than 5 MB for 90% of data. Only 2 data results are 120 MB and 260 MB.
Main hot-spot is coalesce(1) operation as we have requirement to produce only one file. I can understand 120 MB and 260 MB gziped file generation and writing taking time. But generation and writing less than 5MB file should be fast. When I monitor job I can see lot of time is taken by coalesce and save data file. I am clueless why it should take 60-70 secs for generating and writing 2-3 MB file.
Configuration:
I have achieved some performance gain with fat executors of 3 vcores per executor. I am using 1 master 3 worker cluster with 4 core node.
Regards
Manish Zope

How to overcome the Spark spark.kryoserializer.buffer.max 2g limit?

I am reading a csv with 600 records using spark 2.4.2. Last 100 records have large data.
I am running into the problem of,
ERROR Job aborted due to stage failure:
Task 1 in stage 0.0 failed 4 times, most recent failure:
Lost task 1.3 in stage 0.0 (TID 5, 10.244.5.133, executor 3):
org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 47094.
To avoid this, increase spark.kryoserializer.buffer.max value.
I have increased the spark.kryoserializer.buffer.max to 2g (the max allowed setting) and spark driver memory to 1g and was able to process few more records but still cannot process all the records in the csv.
I have tried paging the 600 records. e.g With 6 partition I can process 100 records per partition but since the last 100 records are huge the buffer overflow occurs.
In this case, the last 100 records are large but this can be the first 100 or records between 300 to 400. Unless I sample the data before hand to get an idea on the skew I cannot optimize the processing approach.
Is there a reason why spark.kryoserializer.buffer.max is not allowed to go beyond 2g.
May be I can increase the partitions and decrease the records read per partition? Is it possible to use compression?
Appreciate any thoughts.
Kryo buffers are backed by byte arrays, and primitive arrays can only be
up to 2GB in size.
Please refer to the below link for further details.
https://github.com/apache/spark/commit/49d2ec63eccec8a3a78b15b583c36f84310fc6f0
Please increase the partition number since you cannot optimize the processing approach.
What do you have in those records that a single one blows the kryo buffer.
In general leaving the partitions at default 200 should always be a good starting point. Don't reduce it to 6.
It looks like a single record (line) blows the limit.
There are number of options for reading in the csv data you can try csv options
If there is a single line that translates into a 2GB buffer overflow I would think about parsing the file differently.
csv reader also ignores/skips some text in files (no serialization) if you give it a schema.
If you remove some of the columns that are so huge from the schema it may read in the data easily.

Need For Large Executor Memory If Block size is 128 MB

I have a question regarding spark. I am using spark 2.2 and as per my knowledge each executor spins up taks and executes the task. Each task corresponds to a partition. Default number of partition is based on default parallelism and the file size/Default Block Size. So considering a file size of 1 GB and a cluster of 4 executors each of which can spin up 2 tasks (2 core). As per calculation the executor memory should be about 256 MB (2 tasks each task operating on 128 MB block)+ 384 MB overhead. However If I run the code with this size as executor memory the performance is slow. If I give executor memory of 1.5 GB (considering some calculations on rdd) still the performance is slow. Only when I increase the executor memory to 3GB the performance is Good.
Can someone explain
1. why do we need so much executor memory when we work on only 128 MB of data at a time.
2. How do we calculate the optimum executor memory needed for the job
Thanks for your help

LinearRegressionWithSGD doesn't converge on file of more than 11Mb

I'm using Spark 1.6.1 along with scala 2.11.7 on my Ubuntu 14.04 with following memory settings for my project: JAVA_OPTS="-Xmx8G -Xms2G".
My data is organized in 20 json-like files, every file is about 8-15 Mb, containing categorical and numerical values. I parse this data, passing by DataFrame facilities and then scale one numerical feature and create dummy variables for categorical features.
So far from initial 14 keys of my json-like file I get about 200-240 features in the final LabeledPoint. The final data is sparse and every file contains about 20000-30000 of observations.
I try to run two types of algorithms on data : LinearRegressionWithSGD or LassoWithSGD, since the data is sparse and regularization might be required.
For data larger than 11MB LinearRegressionWithSGD fails with the following error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 58 in stage 346.0 failed 1 times, most recent failure: Lost task 58.0 in stage 346.0 (TID 18140, localhost): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 179307 ms.
I faced the same problem with the file of 11Mb (for file of 5MB algorithm works well), and after trying a lot of debug options (testing different options for driver.memory & executors.memory, making sure that cache is cleared properly, proper use of coalesce()), I've found out that setting up the StepSize of Gradientt Descent to 1 resolves this bug (while for 5MB file-size StepSize = 0.4 doesn't bug and gives the better results).
So I tried to augment the StepSize for file-size of 12MB (setting up of StepSize to 1.5 and 2) but it didn't work. If I take only 10 Mb of file instead of whole file, the algorithm doesn't fail.
It's very embarrassing since I need construct the model on whole file, that seems to be still far from Big Data formats.
If I can not run Linear Regression on 12 Mb, could I run it on larger sets? I notices that while using StandardScaler on preprocessing step and counts on Linear Regression step, collect() method is perform, that can cause the bug. So the possibility to scale Linear regression is questioned, cause, as I far as I understand it, collect() performs on driver and so the sens of scaled calculations is lost.
The following parameters are set:
val algorithme = new LinearRegressionWithSGD() //LassoWithSGD()
algorithme.setIntercept(true)
algorithme.optimizer
.setNumIterations(100)
.setStepSize(1)

Resources