Spark and SparklyR error "grows beyond 64 KB" - apache-spark

I am getting the following error on Spark after calling logistic regression using SparklyR and Spark 2.0.2.
ml_logistic_regression(Data, ml_formula)
The dataset that I read into Spark is relatively large (2.2GB). Here is the error message:
Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task
13 in stage 64.0 failed 1 times, most recent failure:
Lost task 13.0 in stage 64.0 (TID 1132, localhost):
java.util.concurrent.ExecutionException:
java.lang.Exception:
failed to compile: org.codehaus.janino.JaninoRuntimeException:
Code of method "(Lorg/apache/spark/sql/catalyst/InternalRow;)Z"
of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate"
grows beyond 64 KB
Others have had a similar issue: https://github.com/rstudio/sparklyr/issues/298 but I cannot find a resolution. Any ideas?

What happens when you subset the data and try running the model? You might need to change your configuration settings to deal with the size of the data:
library(dplyr)
library(sparklyr)
#configure the spark session and connect
config <- spark_config()
config$`sparklyr.shell.driver-memory` <- "XXG" #change depending on the size of the data
config$`sparklyr.shell.executor-memory` <- "XXG"
sc <- spark_connect(master='yarn-client', spark_home='/XXXX/XXXX/XXXX',config = config)
There are other settings in spark_config() that you could change as well to deal with performance. This is just an example of a couple.

Related

Record larger than the Split size in AWS GLUE?

I'm Newbie in AWS Glue and Spark.
I build my ETL in this.
When connect my s3 with files of 200mb approximately not read this.
The error is that
An error was encountered:
An error occurred while calling o99.toDF.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 (TID 16) (91ec547edca7 executor driver): com.amazonaws.services.glue.util.NonFatalException: Record larger than the Split size: 67108864
Update 1:
When split my json file(200mb) with jq, in two parts AWS GLUE, read with normally both parts
My solution is a lambda splitting file, but i want to know how aws glue split works
Thanks and Regards

org.apache.spark.SparkException: Job aborted due to stage failure in pyspark

Sorry for the duplicate post. I'm creating again another post as those posts are unable to solve my problem.
I'm running ML Regression on pyspark 3.0.1. I'm running it on a cluster of 640 GB memory & 32 worker node.
I have a data set with 33751 rows & 63 columns. I'm trying to prepare the data set for ML regression. So I wrote following code
from pyspark.ml.feature import VectorAssembler, StandardScaler
input_col=[...]
vector_assembler=VectorAssembler(inputCols=input_col,outputCol='ss_feature')
temp_train=vector_assembler.transform(train)
standard_scaler=StandardScaler(inputCol='ss_feature',outputCol='scaled')
train=standard_scaler.fit(temp_train).transform(temp_train)
But I'm getting error message at the last line execution
org.apache.spark.SparkException: Job aborted due to stage failure: Task 169 in stage 57.0 failed 4
times, most recent failure: Lost task 169.3 in stage 57.0 (TID 5522, 10.8.64.22, executor 11):
org.apache.spark.SparkException: Failed to execute user defined
function(VectorAssembler$$Lambda$6296/1890764576:
Can you suggest me how do I solve this issue?

Can't get the johnsnow OCR notebook run on databricks

So I am trying to follow this notebook and get it to work on a databricks notebook: https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/ocr-spell/OcrSpellChecking.ipynb ; However, after installing all the packages, I still get stuck by the time I get to
{ // for displaying
val regions = data.select("region").collect().map(_.get(0))
regions.foreach{chunk =>
println("---------------")
println(chunk)}
}
Error message is:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 51, 10.195.249.145, executor 4): java.lang.NoClassDefFoundError: Could not initialize class net.sourceforge.tess4j.TessAPI
Anyone knows why? Much appreciated!
To use Spark NLP OCR you need to install Tesseract 4.x+ as the documentation has stated. In the cluster, you must have this on all the nodes. However, if you are just dealing with PDF and not scanned images you can probably skip Tesseract 4.x+ installation:
import com.johnsnowlabs.nlp.util.io.OcrHelper
val ocrHelper = new OcrHelper()
val df = ocrHelper.createDataset(spark, "/tmp/Test.pdf")
Update: There is a new doc for Spark OCR and special instructions for Databricks:
https://nlp.johnsnowlabs.com/docs/en/ocr

when more cores used Spark (Sparklyr) error to many files open

I am using the following configuration on Sparklyr in local mode:
conf <- spark_config()
conf$`sparklyr.cores.local` <- 28
conf$`sparklyr.shell.driver-memory` <- "1000G"
conf$spark.memory.fraction <- 0.9
sc <- spark_connect(master = "local",
version = "2.1.1",
config = conf)
This works fine when I read in a csv using spark_read_csv. However when I used more cores such as
conf <- spark_config()
conf$`sparklyr.cores.local` <- 30
conf$`sparklyr.shell.driver-memory` <- "1000G"
conf$spark.memory.fraction <- 0.9
I get the following error:
Blockquote
Error in value[3L] :
Failed to fetch data: org.apache.spark.SparkException: Job aborted due
to stage failure: Task 10 in stage 3.0 failed 1 times, most recent
failure: Lost task 10.0 in stage 3.0 (TID 132, localhost, executor
driver): java.io.FileNotFoundException: /tmp/blockmgr-9ded7dfb-20b8-
4c72-8a6f-2db12ba884fb/1f/temp_shuffle_e69d56ba-80b4-499f-a91f-
0ae63fe4553f (Too many open files)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:102)
at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:115)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:235)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:152)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMa
I increased ulimit to 419430 from 1040 (both soft and hard) and this made no difference.
My VM has 128 cores and 2T of memory and I'd like to be able to use all of it.
Any suggestions?
Spark local mode is intended for low-volume experimentation and unit testing, not for production usage and problems with system limits are just a peak of the iceberg. It operates in a single JVM and you may expect a lot of different issues just with memory management.
Overall Spark is designed for scaling out not for scaling up. You shouldn't expect performance gains and painless operation when increasing resources in local mode. Moreover, if computational resources are not backed with high throughput disk configuration your resources will be underutilized.

Read, sort and count 20GB CSV file stored in HDFS by using pyspark RDD

I am new in Spark Hadoop. I got one use case in which I am trying to read, count number of records and sort data in 20GB CSV file. The problem is when I am using these functions its not working. Here is my code please have a look and suggest me the approach to handle the large file with spark RDD.
import findspark
findspark.init()
from pyspark import SparkConf, SparkContext
APP_NAME = 'My Spark Application'
file = 0
conf = SparkConf().setAppName("APP_NAME").setMaster("local")
sc = SparkContext(conf=conf)
val_file = sc.textFile("hdfs://localhost:50000/yottaa/transactions.csv")
val_file.count() ### Its taking 10 mins to execute and produce result.
val_file.count() ---> It's taking 10 mins time to count rows, How can I increase speed ?. I'm using 16GB RAM laptop and when I am giving val_file.collect() statement it shows me following error:
Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0
in stage 0.0 (TID 0, localhost): java.lang.OutOfMemoryError: GC
overhead limit exceeded at
java.nio.HeapCharBuffer.(HeapCharBuffer.java:57) at
java.nio.CharBuffer.allocate(CharBuffer.java:331) at
java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:777) at
org.apache.hadoop.io.Text.decode(Text.java:412) at
org.apache.hadoop.io.Text.decode(Text.java:389) at
org.apache.hadoop.io.Text.toString(Text.java:280) at

Resources