LinearRegressionWithSGD doesn't converge on file of more than 11Mb - apache-spark

I'm using Spark 1.6.1 along with scala 2.11.7 on my Ubuntu 14.04 with following memory settings for my project: JAVA_OPTS="-Xmx8G -Xms2G".
My data is organized in 20 json-like files, every file is about 8-15 Mb, containing categorical and numerical values. I parse this data, passing by DataFrame facilities and then scale one numerical feature and create dummy variables for categorical features.
So far from initial 14 keys of my json-like file I get about 200-240 features in the final LabeledPoint. The final data is sparse and every file contains about 20000-30000 of observations.
I try to run two types of algorithms on data : LinearRegressionWithSGD or LassoWithSGD, since the data is sparse and regularization might be required.
For data larger than 11MB LinearRegressionWithSGD fails with the following error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 58 in stage 346.0 failed 1 times, most recent failure: Lost task 58.0 in stage 346.0 (TID 18140, localhost): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 179307 ms.
I faced the same problem with the file of 11Mb (for file of 5MB algorithm works well), and after trying a lot of debug options (testing different options for driver.memory & executors.memory, making sure that cache is cleared properly, proper use of coalesce()), I've found out that setting up the StepSize of Gradientt Descent to 1 resolves this bug (while for 5MB file-size StepSize = 0.4 doesn't bug and gives the better results).
So I tried to augment the StepSize for file-size of 12MB (setting up of StepSize to 1.5 and 2) but it didn't work. If I take only 10 Mb of file instead of whole file, the algorithm doesn't fail.
It's very embarrassing since I need construct the model on whole file, that seems to be still far from Big Data formats.
If I can not run Linear Regression on 12 Mb, could I run it on larger sets? I notices that while using StandardScaler on preprocessing step and counts on Linear Regression step, collect() method is perform, that can cause the bug. So the possibility to scale Linear regression is questioned, cause, as I far as I understand it, collect() performs on driver and so the sens of scaled calculations is lost.
The following parameters are set:
val algorithme = new LinearRegressionWithSGD() //LassoWithSGD()
algorithme.setIntercept(true)
algorithme.optimizer
.setNumIterations(100)
.setStepSize(1)

Related

Spark breaks when you need to make a very large shuffle

I'm working with 1 terabytes of data, and at a moment I need to join two smaller dataframes, I don't know the size, but it has more than 200 GB and I get the error below.
The break occurs in the middle of the operation after 2 hours.
It seems to me to be a memory stick, but that doesn't make sense, because looking at the UI Spark Ganglia, the RAM memory doesn't reach the limit as shown in the print below.
Does anyone have any idea how I can solve this without decreasing the amount of data analyzed.
My cluster has:
1 x master node n1-highmem-32
4 x slave node n1-highmem-32
[org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 482.1 failed 4 times, most recent failure: Lost task 3.3 in stage 482.1 (TID 119785, 10.0.101.141, executor 1): java.io.FileNotFoundException: /tmp/spark-83927f3e-4511-1b/3d/shuffle_248_72_0.data.f3838fbc-3d38-4889-b1e9-298f743800d0 (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
Caused by: java.io.FileNotFoundException: /tmp/spark-83927f3e-4511-1b/3d/shuffle_248_72_0.data.f3838fbc-3d38-4889-b1e9-298f743800d0 (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)][1]
This types of errors typically occur when there are deeper problems with some tasks, like significant data skew. Since you don't provide enough details (please be sure to read How To Ask and How to create a Minimal, Complete, and Verifiable example) and job statistics the only approach that I can think off is to significantly increase number of shuffle partitions:
´´´
sqlContext.setConf("spark.sql.shuffle.partitions", 2048)
´´´

PySpark: Job aborts due to stage failure, but resetting max size isn't recognized

I'm attempting to display a dataframe in PySpark after reading the files in using a function/subroutine. Reading the files in works greatly, but it's the display that's not working. Actually, due to lazy evaluation, this may not be true.
I get this error
SparkException: Job aborted due to stage failure: Total size of serialized results of 29381 tasks (4.0 GB) is bigger than spark.driver.maxResultSize (4.0 GB)
so I do what was suggested https://forums.databricks.com/questions/66/how-do-i-work-around-this-error-when-using-rddcoll.html
sqlContext.setConf("spark.driver.maxResultSize", "8g")
sqlContext.getConf("spark.driver.maxResultSize")
however, the bizarre part is, this gives the same error back when I re-run the display(df) command.
It's like Spark is just ignoring my commands.
I've tried increasing the number of workers and making both the worker type and driver type larger, but neither of these fixed anything.
How can I get this to work? or is this a bug in Databricks/Spark?
It all depends on your code and partitioning of the code with respect to the cluster size. Increasing spark.driver.maxResultSize is the first option to solve the problem and eventually look for a permanent solution to modify the code or design. Please do avoid collecting more data to driver node.
OR
You need to change this parameter in the cluster configuration. Go into the cluster settings, under Advanced select spark and paste spark.driver.maxResultSize 0 (for unlimited) or whatever the value suits you. Using 0 is not recommended. You should optimize the job by re partitioning.
For more details, refer "Spark Configurations - Application Properties".
Hope this helps. Do let us know if you any further queries.

How to overcome the Spark spark.kryoserializer.buffer.max 2g limit?

I am reading a csv with 600 records using spark 2.4.2. Last 100 records have large data.
I am running into the problem of,
ERROR Job aborted due to stage failure:
Task 1 in stage 0.0 failed 4 times, most recent failure:
Lost task 1.3 in stage 0.0 (TID 5, 10.244.5.133, executor 3):
org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 47094.
To avoid this, increase spark.kryoserializer.buffer.max value.
I have increased the spark.kryoserializer.buffer.max to 2g (the max allowed setting) and spark driver memory to 1g and was able to process few more records but still cannot process all the records in the csv.
I have tried paging the 600 records. e.g With 6 partition I can process 100 records per partition but since the last 100 records are huge the buffer overflow occurs.
In this case, the last 100 records are large but this can be the first 100 or records between 300 to 400. Unless I sample the data before hand to get an idea on the skew I cannot optimize the processing approach.
Is there a reason why spark.kryoserializer.buffer.max is not allowed to go beyond 2g.
May be I can increase the partitions and decrease the records read per partition? Is it possible to use compression?
Appreciate any thoughts.
Kryo buffers are backed by byte arrays, and primitive arrays can only be
up to 2GB in size.
Please refer to the below link for further details.
https://github.com/apache/spark/commit/49d2ec63eccec8a3a78b15b583c36f84310fc6f0
Please increase the partition number since you cannot optimize the processing approach.
What do you have in those records that a single one blows the kryo buffer.
In general leaving the partitions at default 200 should always be a good starting point. Don't reduce it to 6.
It looks like a single record (line) blows the limit.
There are number of options for reading in the csv data you can try csv options
If there is a single line that translates into a 2GB buffer overflow I would think about parsing the file differently.
csv reader also ignores/skips some text in files (no serialization) if you give it a schema.
If you remove some of the columns that are so huge from the schema it may read in the data easily.

Databricks Checksum error while writing to a file

I am running a job in 9 nodes.
All of them are going to write some information to files doing simple writes like below:
dfLogging.coalesce(1).write.format('delta').mode('append').save('/dbfs/' + loggingLocation)
However I am receiving this exception:
py4j.protocol.Py4JJavaError: An error occurred while calling
o106.save. : java.util.concurrent.ExecutionException:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 1 in stage 14.0 failed 1 times, most recent failure: Lost task
1.0 in stage 14.0 (TID 259, localhost, executor driver): org.apache.hadoop.fs.ChecksumException: Checksum error:
file:/dbfs/delta/Logging/_delta_log/00000000000000000063.json at 0
exp: 1179219224 got: -1020415797
It looks to me, that because of concurrency, spark is somehow failing and it generates checksum errors.
Is there any known scenario that may be causing it?
So there are a couple of things going on and it should explain why coalesce may not work.
What coalesce does is it essentially combines the partitions across each worker. For example, if you have three workers, you can perform coalesce(3) which would consolidate the partitions on each worker.
What repartition does is it shuffles the data to increase/decrease the number of total partitions. In your case, if you have more than one worker and if you need a single output, you would have to use repartition(1) since you want the data to be on a single partition before writing it out.
Why coalesce would not work?
Spark limits the shuffling during coalesce. So you cannot perform a full shuffle (across different workers) when you are using coalesce, whereas you can perform a full shuffle when you are using repartition, although it is an expensive operation.
Here is the code that would work:
dfLogging.repartition(1).write.format('delta').mode('append').save('/dbfs/' + loggingLocation)

Spark LDA woes - prediction and OOM questions

I'm evaluating Spark 1.6.0 to build and predict against large (millions of docs, millions of features, thousands of topics) LDA models, something I can accomplish pretty easily with Yahoo! LDA.
Starting small, following the Java examples, I built a 100K doc/600K feature/250 topic/100 iteration model using the Distributed model/EM optimizer. The model built fine and the resulting topics were coherent. I then wrote a wrapper around the new ​single-document prediction routine (SPARK-10809; which I cherry picked into a custom Spark 1.6.0-based distribution) to get topics for new, unseen documents (​skeleton code). The resulting predictions were slow to generate (which I offered a fix for in SPARK-10809) but more worrisome, incoherent (​topics/predictions). If a document's predominantly about football, I'd expect the "football" topic (topic 18) to be in the top 10.
Not being able to tell if something's wrong in my prediction code - or if it's because I was using the Distributed/EM-based model (as is ​hinted at here by jasonl here) - I decided to try the newer Local/Online model. I spent a couple of days tuning my 240 core/768GB RAM 3-node cluster to no avail; seemingly no matter what I try, I run out of memory attempting to build a model this way.
I tried various settings for:
driver-memory (8G)
executor-memory (1-225G)
spark.driver.maxResultSize (including disabling it)
spark.memory.offheap.enabled (true/false)
spark.broadcast.blockSize (currently at 8m)
spark.rdd.compress (currently true)
changing the serializer (currently Kryo) and its max buffer (512m)
increasing various timeouts to allow for longer computation
(executor.heartbeatInterval, rpc.ask/lookupTimeout,
spark.network.timeout) spark.akka.frameSize (1024)
At different settings, it seems to oscillate between a JVM core dump due to off-heap allocation errors (Native memory allocation (mmap) failed to map X bytes for committing reserved memory) and java.lang.OutOfMemoryError: Java heap space. I see references to models being built near my order of magnitude (databricks.com/blog/2015/03/25/topic-modeling-with-lda-mllib-meets-graphx.html), so I must be doing something wrong.
Questions:
Does my prediction routine look OK? Is this an off-by-one error
somewhere w.r.t the irrelevant predicted topics?
Do I stand a chance of building a model with Spark on the order of magnitude described above? Yahoo can do it with modest RAM requirements.
Any pointers as to what I can try next would be much appreciated!

Resources