I am following spark tutorial from Watson Studio Gallery on IBM Cloud (https://eu-de.dataplatform.cloud.ibm.com/exchange/public/entry/view/99b857815e69353c04d95daefb3b91fa?context=cpdaas) and run into Java stack overflow problem:
Py4JJavaError: An error occurred while calling o20418.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.StackOverflowError
java.lang.StackOverflowError
at scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:516)
at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1154)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
The problem line :
cvModel = crossval.fit(trainingRatings)
The problem cell:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
(trainingRatings, validationRatings) = ratings.randomSplit([80.0, 20.0])
evaluator = RegressionEvaluator(metricName='rmse', labelCol='rating', predictionCol='prediction')
paramGrid = ParamGridBuilder().addGrid(als.rank, [1, 5, 10]).addGrid(als.maxIter, [20]).addGrid(als.regParam, [0.05, 0.1, 0.5]).build()
crossval = CrossValidator(estimator=als, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=10)
cvModel = crossval.fit(trainingRatings)
predictions = cvModel.transform(validationRatings)
print('The root mean squared error for our model is: {}'.format(evaluator.evaluate(predictions.na.drop())))
Environment used: Default Spark 3.2 & Python 3.9
Will be grateful for any help.
I resolved the problem by adding more memory to VM hosting the notebook.
Related
I'm trying to create a DataFrame from 2 custom sentences, just to test. But from the code I made I'm unable to create it.
spark = SparkSession.builder.appName('first').getOrCreate()
df = spark.createDataFrame(
[
(0, "Hi this is a Spark tutorial"),
(1, "This tutorial is made in Python language")
], ['id', 'sentence']
)
df.show()
This gives me this error:
Py4JJavaError: An error occurred while calling o73.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2) (executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
I tried to create a schema
schema = StructType(
[StructField("id", IntegerType(), True),
StructField("sentence", StringType(), True)]
)
and pass it like an argument schema=schema but it is the same roadend.
My code was working fine earlier but now it is showing cache memory issue.My program involves DataFrame loading,transformation and processing running on jupyter notebook connected with pyspark-shell.
I am not understanding what is the main issue and how to tackle it.Any Help is highly appreciated.
My code is:
import time
start = time.time()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('spark://172.16.12.200:7077').appName('new').getOrCreate()
ndf = spark.read.json("Musical_Instruments.json")
pd=ndf.select(ndf['asin'],ndf['overall'],ndf['reviewerID'])
spark.sparkContext.setCheckpointDir("/home/npproject/jupyter_files/checkpoints")
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import TrainValidationSplit,ParamGridBuilder
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
from pyspark.sql.functions import col
indexer = [StringIndexer(inputCol=column, outputCol=column+"_index") for column in list(set(pd.columns)-set(['overall'])) ]
pipeline = Pipeline(stages=indexer)
transformed = pipeline.fit(pd).transform(pd)
(training,test)=transformed.randomSplit([0.8, 0.2])
als=ALS(maxIter=5,regParam=0.09,rank=25,userCol="reviewerID_index",itemCol="asin_index",ratingCol="overall",coldStartStrategy="drop",nonnegative=True)
model=als.fit(training)
evaluator=RegressionEvaluator(metricName="rmse",labelCol="overall",predictionCol="prediction")
predictions=model.transform(test)
rmse=evaluator.evaluate(predictions)
print("RMSE="+str(rmse))
print("Rank: ",model.rank)
print("MaxIter: ",model._java_obj.parent().getMaxIter())
print("RegParam: ",model._java_obj.parent().getRegParam())
user_recs=model.recommendForAllUsers(10).show(20)
end = time.time()
print("execution time",end-start)
Error code is:
Error:
Py4JJavaError: An error occurred while calling o40.json.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 5, 172.16.12.208, executor 1): java.io.FileNotFoundException: File file:/home/npproject/jupyter_files /Musical_Instruments.json does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
I am struggling in getting the my code running on Zeppelin on EMR (emr-5.10.0, Zeppelin 0.7.3, Spark 2.2.0).
The code is straight forward, fitting a CrossValidator withRandomForestClassifier on a training DataFrame of 400K samples (~40K positives and 360K negatives).
When I run a simple training (say 100 trees with 15 max depth) everything goes OK, but when I run a heavier test with more values in the ParamGridBuilder I get a org.apache.thrift.transport.TTransportException and I do not know how to trace the reason of that error.
I am using a cluster of three c3.8xlarge machines with the following settings of Spark interpreter on Zeppelin:
spark.executor.memory = 15g
spark.yarn.executor.memoryOverhead = 2048
spark.executor.cores = 10
I played with spark.memory.fraction without success, I also tried to change the number of executors by playing with the three settings above, without success.
I have a feeling that it is rather a zeppelin problem but I am not able to trace the reason of the exception. I looked into the logs without finding any exception other than the TTransportException, which is not helpful by itself.
Any help or hint in how to trace the reaction of the exception is highly appreciated.
Here is the code:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler}
import org.apache.spark.ml.linalg.Vectors
val genreIndexer = new StringIndexer()
.setInputCol("genre")
.setOutputCol("genreIndex")
.setHandleInvalid("skip")
val genreEncoder = new OneHotEncoder()
.setInputCol(genreIndexer.getOutputCol)
.setOutputCol("genreVec")
val featuresAssembler = new VectorAssembler()
.setInputCols(Array("hourOfDay", "dayOfWeek_number", "dayOfMonth", "genreVec"))
.setOutputCol("features")
val classifier = new RandomForestClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
val paramGrid = new ParamGridBuilder()
.addGrid(classifier.numTrees, Array(200, 400))
.addGrid(classifier.maxDepth, Array(10, 20))
.build()
val pipeline = new Pipeline().setStages(Array(genreIndexer, genreEncoder, featuresAssembler, classifier))
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(new BinaryClassificationEvaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(3)
val cvModel = cv.fit(train_df)
Here is the exception as I see it in the logs and on Zeppelin:
org.apache.thrift.transport.TTransportException
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:266)
at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.interpret(RemoteInterpreterService.java:250)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:373)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:97)
at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:406)
at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:329)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I am getting a scala.MatchError when using a ParamGridBuilder in Spark 1.6.1 and 2.0
val paramGrid = new ParamGridBuilder()
.addGrid(lr.regParam, Array(0.1, 0.01))
.addGrid(lr.fitIntercept)
.addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0))
.build()
Error is
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 57.0 failed 1 times, most recent failure: Lost task 0.0 in stage 57.0 (TID 257, localhost):
scala.MatchError: [280000,1.0,[2400.0,9373.0,3.0,1.0,1.0,0.0,0.0,0.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
Full code
The question is how I should use ParamGridBuilder in this case
Problem here is input schema not ParamGridBuilder. Price column is loaded as an integer while LinearRegression is expecting a double. You can fix it by explicitly casting column to required type:
val houses = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(...)
.withColumn("price", $"price".cast("double"))
I am getting a scala.MatchError when using a ParamGridBuilder in Spark 1.6.1 and 2.0
val paramGrid = new ParamGridBuilder()
.addGrid(lr.regParam, Array(0.1, 0.01))
.addGrid(lr.fitIntercept)
.addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0))
.build()
Error is
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 57.0 failed 1 times, most recent failure: Lost task 0.0 in stage 57.0 (TID 257, localhost):
scala.MatchError: [280000,1.0,[2400.0,9373.0,3.0,1.0,1.0,0.0,0.0,0.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
Full code
The question is how I should use ParamGridBuilder in this case
Problem here is input schema not ParamGridBuilder. Price column is loaded as an integer while LinearRegression is expecting a double. You can fix it by explicitly casting column to required type:
val houses = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(...)
.withColumn("price", $"price".cast("double"))