Cross validation fails in Spark-ML - multithreading

I have an execution of Spark-ML with a decision tree and a cross validation inside.
It fails for an unknown reason with this stack trace during the cross validation :
org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
org.apache.spark.ml.tuning.CrossValidator$$anonfun$4$$anonfun$6.apply(CrossValidator.scala:164)
org.apache.spark.ml.tuning.CrossValidator$$anonfun$4$$anonfun$6.apply(CrossValidator.scala:164)
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
org.apache.spark.ml.tuning.CrossValidator$$anonfun$4.apply(CrossValidator.scala:164)
org.apache.spark.ml.tuning.CrossValidator$$anonfun$4.apply(CrossValidator.scala:144)
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:144)
decisionTree.DecisionTreeDisplay.process(DecisionTreeDisplay.scala:151)
Followed by some thread stack traces:
2019-01-23 16:26:21 ERROR TaskSchedulerImpl:91 - Exception in
statusUpdate java.util.concurrent.RejectedExecutionException: Task
org.apache.spark.scheduler.TaskResultGetter$$anon$3#764726a7 rejected
from java.util.concurrent.ThreadPoolExecutor#783b07b9[Shutting down,
pool size = 2, active threads = 2, queued tasks = 0, completed tasks =
4914] at
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
at
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
at
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
at
org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61)
at
org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:413)
at
org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:394)
at
org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:67)
at
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at
org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
My cross validation code is:
// define Cross-Validation
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(3)
.setSeed(seed)
.setCollectSubModels(true) // requires version of spark >= 2.3.0
.setParallelism(8) // requires version of spark >= 2.3.0
val cvModel = cv.fit(trainInfile) //Fail here
In the ML library it seems to fail at line:
val foldMetrics = foldMetricFutures.map(ThreadUtils.awaitResult(_, Duration.Inf))
Any idea?

Related

Spark , OOM on left join ( inner is fine) , How do I disable boardcast for specific dataframe?

I have two dataframe with exactly same join keys
spark config:
spark_config = {
'spark.driver.memory': '1g',
'spark.driver.maxResultSize': '1g',
'spark.executor.memory': '1g',
'spark.executor.memoryOverhead': '1g',
'spark.kryoserializer.buffer.max': 1024,
'spark.executor.instances': 2,
'spark.executor.cores': 1,
...
}
Inner join is fine
df1 = spark.read.parquet('/data/pipeline-workspace/91_active/3d2e21c0-3d0f-11ec-a384-b9dfba307662/segment')
df2 = spark.read.parquet('/data/pipeline-workspace/91_active/b0fe56e6-01d8-4e18-a8a8-d53e4e363fe8/daily_merged2')
df3 = df1.drop('__uuid__', '__timestamp__').join(df2, on="__label_id__", how='inner')
df3.write.parquet('/data/test', mode='overwrite')
count :
>>> df1.count()
102301
>>> df2.count()
102301
>>> df3.count()
102301
But left join got OOM
df3 = df1.drop('__uuid__', '__timestamp__').join(df2, on="__label_id__", how='left')
df3.write.parquet('/data/test', mode='overwrite')
logs:
Py4JJavaError: An error occurred while calling o374.parquet.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:202)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:136)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:160)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:157)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:132)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:696)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:305)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:291)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:249)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:586)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.OutOfMemoryError: Not enough memory to build and broadcast the table to all worker nodes. As a workaround, you can either disable broadcast by setting spark.sql.autoBroadcastJoinThreshold to -1 or increase the spark driver memory by setting spark.driver.memory to a higher value
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:122)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:76)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withExecutionId$1.apply(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
at org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:100)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:75)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:75)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
Set spark.sql.autoBroadcastJoinThreshold=-1 does solve the OOM, but it is not the correct behavior. Because we need autoBroadcast little table, shouldn't disable it .
I tried df2.collect() got OOM too,
so the problem becomes:
why spark think the size of df2 is lower than 10M ?
Could I mark the df2 with some tag no auto broadcast so that don't need to set spark.sql.autoBroadcastJoinThreshold=-1

Error with SANSA unable to create RDD from NT file

Unable to Create RDD[Triple] using
sparkSession.rdf(Lang.NTRIPLES)(path)
Used to working without issue with Java 11, and Spark 2.4.x
Not working ,throwing error when using Java 8 and Spark 3.0
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 28499
at com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:532)
at com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:315)
at com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:102)
at com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:76)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:45)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:59)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:59)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:292)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:292)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:289)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:59)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$19(BeanIntrospector.scala:181)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at scala.collection.TraversableLike.map(TraversableLike.scala:285)
at scala.collection.TraversableLike.map$(TraversableLike.scala:278)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14(BeanIntrospector.scala:175)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14$adapted(BeanIntrospector.scala:174)
at scala.collection.immutable.List.flatMap(List.scala:366)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.apply(BeanIntrospector.scala:174)
at com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$._descriptorFor(ScalaAnnotationIntrospectorModule.scala:21)
at com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.fieldName(ScalaAnnotationIntrospectorModule.scala:29)
at com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.findImplicitPropertyName(ScalaAnnotationIntrospectorModule.scala:77)
at com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.findImplicitPropertyName(AnnotationIntrospectorPair.java:490)
at com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector._addFields(POJOPropertiesCollector.java:380)
at com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector.collectAll(POJOPropertiesCollector.java:308)
at com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector.getJsonValueAccessor(POJOPropertiesCollector.java:196)
at com.fasterxml.jackson.databind.introspect.BasicBeanDescription.findJsonValueAccessor(BasicBeanDescription.java:252)
at com.fasterxml.jackson.databind.ser.BasicSerializerFactory.findSerializerByAnnotations(BasicSerializerFactory.java:346)
at com.fasterxml.jackson.databind.ser.BeanSerializerFactory._createSerializer2(BeanSerializerFactory.java:216)
at com.fasterxml.jackson.databind.ser.BeanSerializerFactory.createSerializer(BeanSerializerFactory.java:165)
at com.fasterxml.jackson.databind.SerializerProvider._createUntypedSerializer(SerializerProvider.java:1388)
at com.fasterxml.jackson.databind.SerializerProvider._createAndCacheUntypedSerializer(SerializerProvider.java:1336)
at com.fasterxml.jackson.databind.SerializerProvider.findValueSerializer(SerializerProvider.java:510)
at com.fasterxml.jackson.databind.SerializerProvider.findTypedValueSerializer(SerializerProvider.java:713)
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:308)
at com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:4094)
at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:3404)
at org.apache.spark.rdd.RDDOperationScope.toJson(RDDOperationScope.scala:52)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:145)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:751)
at org.apache.spark.SparkContext.makeRDD(SparkContext.scala:855)
at com.xx.yy.catalog._CatalogDataBuilder.fromTriples(CatalogDataBuilder.scala:433)
***
***
at com.xx.yy.example.TestExample.main(TestExample.scala)
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 28499
I also had a java.lang.ArrayIndexOutOfBoundsException: 28499, very similar, after migrating to Spark 3.0.1 from 2.4.3, when performing a count, countApprox or rdd operation on Spark datasets.
For me, this solution worked:
https://programmersought.com/article/35311239379/
Basically I added this dependency:
<dependency>
<groupId>com.thoughtworks.paranamer</groupId>
<artifactId>paranamer</artifactId>
<version>2.8</version>
</dependency>

How to load Collection data types using spark-cassandra connector in batch mode

I am trying to load a spark dataframe which has two attributes with collection datatypes into a Cassandra table.
In the incoming feed file, these attributes are text/String. I used the below code to convert the String type to List and Map types respectively:
spark.udf.register("getLst", (input: String) => input.split(",").toList)
spark.udf.register("getMap", (input:String) => parse(input).values.asInstanceOf[Map[String, String]])
val ofr_data_final=spark.sql("""select
...
getLst(acct_nb_ls) as acct_nb_ls,
getMap(brw_eci_and_sts_mp) as brw_eci_and_sts_mp,
.....""")
The print schema of the spark dataframe shows those two attributes as shown below:
|-- acct_nb_ls: array (nullable = true)
| |-- element: string (containsNull = true)
|-- brw_eci_and_sts_mp: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
In Cassandra, those two attributes are defined as shown below:
acct_nb_ls FROZEN<LIST<text>>,
brw_eci_and_sts_mp FROZEN<MAP<text, text>>,
Here is my load statement:
ofr_data_final.rdd.saveToCassandra(Config.keySpace,offerTable, writeConf = WriteConf(ttl = TTLOption.perRow("ttl")))
However the load fails with the below error:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 140 in stage 24.0 failed 4 times, most recent failure: Lost task 140.3 in stage 24.0 (TID 1741, bdtcstr70n12.svr.us.jpmchase.net, executor 9): java.io.IOException: Failed to write statements to mars_offerdetails.offer_detail_2.
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:167)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:135)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:140)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:110)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:135)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at com.datastax.spark.connector.RDDFunctions.saveToCassandra(RDDFunctions.scala:37)
at com.jpmc.mars.LoadOfferData$.delayedEndpoint$com$jpmc$mars$LoadOfferData$1(LoadOfferData.scala:246)
at com.jpmc.mars.LoadOfferData$delayedInit$body.apply(LoadOfferData.scala:22)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at com.jpmc.mars.LoadOfferData$.main(LoadOfferData.scala:22)
at com.jpmc.mars.LoadOfferData.main(LoadOfferData.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:782)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: Failed to write statements to mars_offerdetails.offer_detail_2.
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:167)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:135)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:140)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:110)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:135)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I suspect the issue might be because the attribute acct_nb_lst is inferred as 'array' and not as 'list' but I am not sure how to make spark infer it as 'list' instead of 'array'. In my UDF, I had defined mentioned
input.split(",").toList
but still it's getting inferred as array.
Loading collection data types using spark-cassandra connector in batch mode worked as expected with ttl option on record level using rdd.saveToCassandra. The issue was with the data. The data was old and had past expired dates which generated negative ttl values and hence the load failed.
Spark error message should be enhanced to imply that.

Is it possible that creating broadcast variables within spark streaming transformation function

I tried to create a recoverable spark streaming job with some arguments got from database. But then I got a problem: it always gives me a serialization error when I try to restart a job from checkpoint.
18/10/18 09:54:33 ERROR Executor: Exception in task 1.0 in stage 56.0 (TID 132) java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration cannot be cast to
scala.collection.MapLike at
com.ptnj.streaming.alertJob.InputDataParser$.kafka_stream_handle(InputDataParser.scala:37)
at
com.ptnj.streaming.alertJob.InstanceAlertJob$$anonfun$1.apply(InstanceAlertJob.scala:38)
at
com.ptnj.streaming.alertJob.InstanceAlertJob$$anonfun$1.apply(InstanceAlertJob.scala:38)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) at
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462) at
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I followed the advice by maxime G in this existing SO question, and it seems to help.
But now there is another exception. And because of that issue,I have to
create broadcast variables while stream transforming, like
val kafka_data_streaming = stream.map(x => DstreamHandle.kafka_stream_handle(url, x.value(), sc))
So it going to be I have to put sparkcontext as a parameter into
transformation function, then it occurs:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at
org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094) at
org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:546)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:546)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:701)
at
org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:264)
at org.apache.spark.streaming.dstream.DStream.map(DStream.scala:545)
at
com.ptnj.streaming.alertJob.InstanceAlertJob$.streaming_main(InstanceAlertJob.scala:38)
at com.ptnj.streaming.AlarmMain$.create_ssc(AlarmMain.scala:36) at
com.ptnj.streaming.AlarmMain$.main(AlarmMain.scala:14) at
com.ptnj.streaming.AlarmMain.main(AlarmMain.scala) Caused by:
java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#5fb7183b)
- field (class: com.ptnj.streaming.alertJob.InstanceAlertJob$$anonfun$1, name: sc$1,
type: class org.apache.spark.SparkContext)
- object (class com.ptnj.streaming.alertJob.InstanceAlertJob$$anonfun$1, )
at
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 14 more
And I have never seen this situation before. Each example shows that broadcast variables would be create in output operation function but not transformation function, so is that possible?

Spark : StackOverflowError trying to convert Java RDD to Data frames

I am using spark 2.0 and trying to convert Java RDD to Data frames.
Here is the code I am using it and my bean has nested beans.
JavaRDD<XXX> mappedRDD = hbaseRDD.map(s->{
//final long serialVersionUID = -2021713021648730786L;
XXX xx=new XXX();
return xx;
});
Dataset<Row> df = sparkSession.createDataFrame(mappedRDD, XXX.class);
I am getting following error
Exception in thread "main" java.lang.StackOverflowError
at sun.reflect.generics.repository.GenericDeclRepository.getTypeParameters(GenericDeclRepository.java:84)
at java.lang.Class.getTypeParameters(Class.java:715)
at org.spark_project.guava.reflect.Types$ParameterizedTypeImpl.<init>(Types.java:288)
at org.spark_project.guava.reflect.Types.newParameterizedType(Types.java:98)
at org.spark_project.guava.reflect.TypeToken.toGenericType(TypeToken.java:917)
at org.spark_project.guava.reflect.TypeToken.getSupertype(TypeToken.java:401)
at org.apache.spark.sql.catalyst.JavaTypeInference$.elementType(JavaTypeInference.scala:132)
at org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:101)
at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:117)
at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:115)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:115)
at org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:101)
at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:117)
at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:115)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)

Resources