Could not initialize class with mapGroupsWithState - apache-spark

I'm trying to create a spark structured streaming application with arbitrary state, when I add groupByKey and mapGroupWithState it gives me an error after starting the first task.
.groupByKey(_.user_id)
.mapGroupsWithState(GroupStateTimeout.NoTimeout)(sessionState.updateAcrossEvents)
Error:
Lost task 5.0 in stage 1.0 (TID 5, node, executor 1): java.lang.NoClassDefFoundError: Could not initialize class com.test.Run$
Run is the main class which contains the main method.
Can someone tell me why this is happening?
PS: if i write directly to console without groupByKey and mapGroupsWithState it works fine.

Related

Creating dynamic frame issue without the pushdown predicate

New to AWS glue, so pardon my question:
Why do I get an error when I don't include a pushdown predicate when creating the dynamic frame. I try to use it without the predicate as I will be using bookmark so only new files will be processed regardless of the date partition.
datasourceDyF = gluecontext.create_dynamic_frame.from_catalog(database=db_name, table_name= table1 ,transformation_ctx = "datasourceDyF")
datasourceDyF.ToDF().show(20)
vs
datasourceDyF = gluecontext.create_dynamic_frame.from_catalog(database=db_name, table_name= table1,transformation_ctx = "datasourceDyF", push_down_predicate = "salesdate = '2020-01-01'")
datasourceDyF.ToDF().show(20)
code 1 is giving this error:
py4j.protocol.Py4JJavaError: An error occurred while calling o76.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times,
most recent failure: Lost task 0.3 in stage 1.0 (TID 4, xxx.xx.xxx.xx, executor 5):
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
The
pushdown predicate
is actually good to use while connecting a RDBMS / table , this helps spark to identify which data to be loaded into it's RAM/memory (i.e. there is no point in loading the data which is not required in the downstream system ). The benefits of using this - due to less data execution happens in a much faster way than a full table load.
Now, in your case , your underlaying table could be a partitioned one hence the pushdown predicate was required.

Spark Job succeeds even with failures

I ran a spark job that takes inputs from two sources, something like:
/home/hadoop/base/hourly/{input1/20190701/,input2/20190701/}
The problem is that these two structures have different schema. The situation I have is that the spark job final status is successful, but does not process that data due to the issue. Because of the successful status, this issue went unnoticed in our clusters for a while.
Is there a way we can ask spark job to fail instead of bailing out successfully.
Here is a snippet of the error in the task log for reference
Job aborted due to stage failure: Task 1429 in stage 2.0 failed 4 times, most recent failure: Lost task 1429.3 in stage 2.0 (TID 1120, 1.mx.if.aaa.com, executor 64): java.lang.UnsupportedOperationException: parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
at parquet.column.Dictionary.decodeToLong(Dictionary.java:52)
at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToLong(ParquetDictionary.java:36)
at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:364)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
Sample of the code I ran:
val ckall = spark.read.parquet("/home/hadoop/base/hourly/{input1/20190701/,input2/20190701/")
ckall.write.parquet("/home/hadoop/output")
Ideally, i expect the final status of the spark job to be a failure
I had a similar issue only to find out it was all my fault.
Basically, my app starting point looked like this:
object MyApp extends App {
private val logger = LoggerFactory.getLogger(getClass)
logger.info(s"Starting $BuildInfo")
val spark: SparkSession = SparkSession.builder.appName("name").getOrCreate()
processing(spark)
spark.stop()
}
And all seems fine. But actually processing(spark) was wrapped in Try and it did not return Unit but Try[Unit]. All executed fine inside, but if an error occurred, it was caught inside and not propagated.
I simply stopped catching the errors and now the app fails like a charm :-).

Databricks Checksum error while writing to a file

I am running a job in 9 nodes.
All of them are going to write some information to files doing simple writes like below:
dfLogging.coalesce(1).write.format('delta').mode('append').save('/dbfs/' + loggingLocation)
However I am receiving this exception:
py4j.protocol.Py4JJavaError: An error occurred while calling
o106.save. : java.util.concurrent.ExecutionException:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 1 in stage 14.0 failed 1 times, most recent failure: Lost task
1.0 in stage 14.0 (TID 259, localhost, executor driver): org.apache.hadoop.fs.ChecksumException: Checksum error:
file:/dbfs/delta/Logging/_delta_log/00000000000000000063.json at 0
exp: 1179219224 got: -1020415797
It looks to me, that because of concurrency, spark is somehow failing and it generates checksum errors.
Is there any known scenario that may be causing it?
So there are a couple of things going on and it should explain why coalesce may not work.
What coalesce does is it essentially combines the partitions across each worker. For example, if you have three workers, you can perform coalesce(3) which would consolidate the partitions on each worker.
What repartition does is it shuffles the data to increase/decrease the number of total partitions. In your case, if you have more than one worker and if you need a single output, you would have to use repartition(1) since you want the data to be on a single partition before writing it out.
Why coalesce would not work?
Spark limits the shuffling during coalesce. So you cannot perform a full shuffle (across different workers) when you are using coalesce, whereas you can perform a full shuffle when you are using repartition, although it is an expensive operation.
Here is the code that would work:
dfLogging.repartition(1).write.format('delta').mode('append').save('/dbfs/' + loggingLocation)

Why is the status of task inconsistent between logs and spark web ui?

I did the following operations on a rdd with 4 partitions in DStreams' foreachRDD function of my spark streaming application:
print rdd.count()
print rdd.collect()
The first statements rdd.count() is normally executed, while the second statement is always blocked by RUNNING status as the follow picture show:
However, when I take a look at the log, it shows that the task has finished.
18/11/09 16:45:30 INFO executor.Executor: Finished task 3.0 in stage 26.0 (TID 555). 197621638 bytes result sent via BlockManager)
What's the problem?
The spark version is pyspark==2.2.1, cluster is spark on yarn.

WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor)

I have a two integration tests for my DataFrame transformation code (using https://github.com/holdenk/spark-testing-base ) and they all run fine when run individually in IntelliJ.
However, when I run my gradle build, for the first test I see the following messages:
17/04/06 11:29:02 WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
And:
17/04/06 11:29:05 ERROR SparkContext: Error initializing SparkContext.
akka.actor.InvalidActorNameException: actor name [ExecutorEndpoint] is not unique!
And:
java.lang.NullPointerException
at org.apache.spark.network.netty.NettyBlockTransferService.close(NettyBlockTransferService.scala:152)
The second test runs partway and aborts with the following message (this code runs fine on the actual cluster BTW):
org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.NullPointerException
org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:80)
Here's a pastebin of the full build output: https://pastebin.com/drG20kcB
How do I run my spark integration tests all together?
Thanks!
PS: If it might be relevant, I'm using gradle wrapper (./gradlew clean build)
I needed this:
test {
maxParallelForks = 1
}
However, if there is a way to turn of parallel execution for a specific subset of tests in gradle, I would much prefer that solution.
I'm using ScalaTest with WordSpec BTW.

Resources