Spark Job succeeds even with failures - apache-spark

I ran a spark job that takes inputs from two sources, something like:
/home/hadoop/base/hourly/{input1/20190701/,input2/20190701/}
The problem is that these two structures have different schema. The situation I have is that the spark job final status is successful, but does not process that data due to the issue. Because of the successful status, this issue went unnoticed in our clusters for a while.
Is there a way we can ask spark job to fail instead of bailing out successfully.
Here is a snippet of the error in the task log for reference
Job aborted due to stage failure: Task 1429 in stage 2.0 failed 4 times, most recent failure: Lost task 1429.3 in stage 2.0 (TID 1120, 1.mx.if.aaa.com, executor 64): java.lang.UnsupportedOperationException: parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
at parquet.column.Dictionary.decodeToLong(Dictionary.java:52)
at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToLong(ParquetDictionary.java:36)
at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:364)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
Sample of the code I ran:
val ckall = spark.read.parquet("/home/hadoop/base/hourly/{input1/20190701/,input2/20190701/")
ckall.write.parquet("/home/hadoop/output")
Ideally, i expect the final status of the spark job to be a failure

I had a similar issue only to find out it was all my fault.
Basically, my app starting point looked like this:
object MyApp extends App {
private val logger = LoggerFactory.getLogger(getClass)
logger.info(s"Starting $BuildInfo")
val spark: SparkSession = SparkSession.builder.appName("name").getOrCreate()
processing(spark)
spark.stop()
}
And all seems fine. But actually processing(spark) was wrapped in Try and it did not return Unit but Try[Unit]. All executed fine inside, but if an error occurred, it was caught inside and not propagated.
I simply stopped catching the errors and now the app fails like a charm :-).

Related

mapGroupsWithState throwing error Caused by: java.lang.NoClassDefFoundError: Could not initialize

Im trying to read a csv and get event state using mapGroupsWithState and writing it to kafka. Below code works if I coment out mapGroupsWithState peice. using spark version 2.3.1
val event = spark.read.option("header","true").csv(path)
val eventSession = imsi.orderBy("event_timestamp")
.groupByKey(_.key)
.mapGroupsWithState(GroupStateTimeout.NoTimeout())(updateAcrossEvents)
eventSession.toJSON.write.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("topic", outputTopic).save
error
User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 85 in stage 11.0 failed 8 times, most recent failure: Lost task 85.7 in stage 11.0 (TID 53, XXX, executor 2): java.lang.NoClassDefFoundError: Could not initialize class xxxx$
at xxx.imsiProcessor$$anonfun$run$1$$anonfun$3.apply(xx.scala:86)
at xxx.imsiProcessor$$anonfun$run$1$$anonfun$3.apply(xx.scala:86)
at org.apache.spark.sql.KeyValueGroupedDataset$$anonfun$3.apply(KeyValueGroupedDataset.scala:279)
at org.apache.spark.sql.KeyValueGroupedDataset$$anonfun$3.apply(KeyValueGroupedDataset.scala:279)
at org.apache.spark.sql.execution.MapGroupsExec$$anonfun$12.apply(objects.scala:361)
at org.apache.spark.sql.execution.MapGroupsExec$$anonfun$12.apply(objects.scala:360)
at org.apache.spark.sql.execution.MapGroupsExec$$anonfun$10$$anonfun$apply$4.apply(objects.scala:337)
at org.apache.spark.sql.execution.MapGroupsExec$$anonfun$10$$anonfun$apply$4.apply(objects.scala:336)
Caused by: org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:367)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:933)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:924)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:924)
at com.telstra.elbrus.core.imsiProcessor$.spark$lzycompute(ImsiProcessor.scala:38)
I was able to get the code running by getting rid of few extends. bare bone code started running.

Could not initialize class with mapGroupsWithState

I'm trying to create a spark structured streaming application with arbitrary state, when I add groupByKey and mapGroupWithState it gives me an error after starting the first task.
.groupByKey(_.user_id)
.mapGroupsWithState(GroupStateTimeout.NoTimeout)(sessionState.updateAcrossEvents)
Error:
Lost task 5.0 in stage 1.0 (TID 5, node, executor 1): java.lang.NoClassDefFoundError: Could not initialize class com.test.Run$
Run is the main class which contains the main method.
Can someone tell me why this is happening?
PS: if i write directly to console without groupByKey and mapGroupsWithState it works fine.

Databricks Checksum error while writing to a file

I am running a job in 9 nodes.
All of them are going to write some information to files doing simple writes like below:
dfLogging.coalesce(1).write.format('delta').mode('append').save('/dbfs/' + loggingLocation)
However I am receiving this exception:
py4j.protocol.Py4JJavaError: An error occurred while calling
o106.save. : java.util.concurrent.ExecutionException:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 1 in stage 14.0 failed 1 times, most recent failure: Lost task
1.0 in stage 14.0 (TID 259, localhost, executor driver): org.apache.hadoop.fs.ChecksumException: Checksum error:
file:/dbfs/delta/Logging/_delta_log/00000000000000000063.json at 0
exp: 1179219224 got: -1020415797
It looks to me, that because of concurrency, spark is somehow failing and it generates checksum errors.
Is there any known scenario that may be causing it?
So there are a couple of things going on and it should explain why coalesce may not work.
What coalesce does is it essentially combines the partitions across each worker. For example, if you have three workers, you can perform coalesce(3) which would consolidate the partitions on each worker.
What repartition does is it shuffles the data to increase/decrease the number of total partitions. In your case, if you have more than one worker and if you need a single output, you would have to use repartition(1) since you want the data to be on a single partition before writing it out.
Why coalesce would not work?
Spark limits the shuffling during coalesce. So you cannot perform a full shuffle (across different workers) when you are using coalesce, whereas you can perform a full shuffle when you are using repartition, although it is an expensive operation.
Here is the code that would work:
dfLogging.repartition(1).write.format('delta').mode('append').save('/dbfs/' + loggingLocation)

Spark streaming from Kafka error causing data loss

I have a Spark streaming application written in Python that collects data from Kafka and stores it on the file system. When I am running it, I see plenty of "holes" in the collected data. After analyzing the logs, I realized that 285000 out of 302000 jobs failed, all with the same exception:
Job aborted due to stage failure: Task 4 in stage 604348.0 failed 1 times,
most recent failure: Lost task 4.0 in stage 604348.0 (TID 2097738, localhost):
kafka.common.OffsetOutOfRangeException
at sun.reflect.GeneratedConstructorAccessor26.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at java.lang.Class.newInstance(Class.java:442)
at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:86)
at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.handleFetchErr(KafkaRDD.scala:184)
at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:193)
at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:208)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at org.apache.spark.util.NextIterator.foreach(NextIterator.scala:21)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239)
I know that this exception happens when trying to access a non-existing offset in Kafka. My Kafka topic has 1 hour retention, and I think that somehow my jobs get stuck for more than hour, and after being released the data is no-longer available in the Kafka queue. I couldn't reproduce this issue in small scale, even with very small retention, and I wonder if the jobs can really get stuck and released as I assumed (and how can I avoid it), or I need to look in a completely different direction.

WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor)

I have a two integration tests for my DataFrame transformation code (using https://github.com/holdenk/spark-testing-base ) and they all run fine when run individually in IntelliJ.
However, when I run my gradle build, for the first test I see the following messages:
17/04/06 11:29:02 WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
And:
17/04/06 11:29:05 ERROR SparkContext: Error initializing SparkContext.
akka.actor.InvalidActorNameException: actor name [ExecutorEndpoint] is not unique!
And:
java.lang.NullPointerException
at org.apache.spark.network.netty.NettyBlockTransferService.close(NettyBlockTransferService.scala:152)
The second test runs partway and aborts with the following message (this code runs fine on the actual cluster BTW):
org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.NullPointerException
org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:80)
Here's a pastebin of the full build output: https://pastebin.com/drG20kcB
How do I run my spark integration tests all together?
Thanks!
PS: If it might be relevant, I'm using gradle wrapper (./gradlew clean build)
I needed this:
test {
maxParallelForks = 1
}
However, if there is a way to turn of parallel execution for a specific subset of tests in gradle, I would much prefer that solution.
I'm using ScalaTest with WordSpec BTW.

Resources