Calling simple count() on Spark dataframe fails - apache-spark

Cluster manager: YARN
Deploy-mode : None
I was told if deploy mode is set to none, stdout of drives process comes at the root path, instead of inside container id of the driver process.
SparkUI logs: Give error Container executed on lost node...
I have unpersisted all other dataframes/datasets before making this call to ensure, they are not cached in memory.
Calling a simple action like count(), keeps failing.
I am essentially doing the following:
columnNames.keys.foreach(
col => {
val nonNullColCount =
dataset.select(dataset(col)).filter(row =>
row.getAs(col) != null).count()
println(nonNullParamsCount)
})
So, i am calling count() on dataset in a loop.
In each iteration, i select a column from a list of column names.
Errors are generic and misleading, in the form of:
Job aborted due to stage failure: Task 284 in stage 14.0 failed 4 times,
most recent failure: Lost task 284.3 in stage 14.0 (TID 100923, ip-172-31-50-226.ec2.internal, executor 266):
ExecutorLostFailure (executor 266 exited caused by one of the running tasks)
Reason: Container marked as failed: container_1506075842477_0672_01_017877 on host: ip-172-31-50-226.ec2.internal.
Exit status: -100.
Diagnostics: Container released on a *lost* node

If you are using AWS spot instance and spot instance taken back of price change you can get following error.
Exit status: -100. Diagnostics: Container released on a lost node
Workaround split the Spark job into many independent steps, so you can save the
result of each step as a file on S3 in short interval or go with non spot instance.

Related

Spark Job succeeds even with failures

I ran a spark job that takes inputs from two sources, something like:
/home/hadoop/base/hourly/{input1/20190701/,input2/20190701/}
The problem is that these two structures have different schema. The situation I have is that the spark job final status is successful, but does not process that data due to the issue. Because of the successful status, this issue went unnoticed in our clusters for a while.
Is there a way we can ask spark job to fail instead of bailing out successfully.
Here is a snippet of the error in the task log for reference
Job aborted due to stage failure: Task 1429 in stage 2.0 failed 4 times, most recent failure: Lost task 1429.3 in stage 2.0 (TID 1120, 1.mx.if.aaa.com, executor 64): java.lang.UnsupportedOperationException: parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
at parquet.column.Dictionary.decodeToLong(Dictionary.java:52)
at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToLong(ParquetDictionary.java:36)
at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:364)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
Sample of the code I ran:
val ckall = spark.read.parquet("/home/hadoop/base/hourly/{input1/20190701/,input2/20190701/")
ckall.write.parquet("/home/hadoop/output")
Ideally, i expect the final status of the spark job to be a failure
I had a similar issue only to find out it was all my fault.
Basically, my app starting point looked like this:
object MyApp extends App {
private val logger = LoggerFactory.getLogger(getClass)
logger.info(s"Starting $BuildInfo")
val spark: SparkSession = SparkSession.builder.appName("name").getOrCreate()
processing(spark)
spark.stop()
}
And all seems fine. But actually processing(spark) was wrapped in Try and it did not return Unit but Try[Unit]. All executed fine inside, but if an error occurred, it was caught inside and not propagated.
I simply stopped catching the errors and now the app fails like a charm :-).

Databricks Checksum error while writing to a file

I am running a job in 9 nodes.
All of them are going to write some information to files doing simple writes like below:
dfLogging.coalesce(1).write.format('delta').mode('append').save('/dbfs/' + loggingLocation)
However I am receiving this exception:
py4j.protocol.Py4JJavaError: An error occurred while calling
o106.save. : java.util.concurrent.ExecutionException:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 1 in stage 14.0 failed 1 times, most recent failure: Lost task
1.0 in stage 14.0 (TID 259, localhost, executor driver): org.apache.hadoop.fs.ChecksumException: Checksum error:
file:/dbfs/delta/Logging/_delta_log/00000000000000000063.json at 0
exp: 1179219224 got: -1020415797
It looks to me, that because of concurrency, spark is somehow failing and it generates checksum errors.
Is there any known scenario that may be causing it?
So there are a couple of things going on and it should explain why coalesce may not work.
What coalesce does is it essentially combines the partitions across each worker. For example, if you have three workers, you can perform coalesce(3) which would consolidate the partitions on each worker.
What repartition does is it shuffles the data to increase/decrease the number of total partitions. In your case, if you have more than one worker and if you need a single output, you would have to use repartition(1) since you want the data to be on a single partition before writing it out.
Why coalesce would not work?
Spark limits the shuffling during coalesce. So you cannot perform a full shuffle (across different workers) when you are using coalesce, whereas you can perform a full shuffle when you are using repartition, although it is an expensive operation.
Here is the code that would work:
dfLogging.repartition(1).write.format('delta').mode('append').save('/dbfs/' + loggingLocation)

Why is the status of task inconsistent between logs and spark web ui?

I did the following operations on a rdd with 4 partitions in DStreams' foreachRDD function of my spark streaming application:
print rdd.count()
print rdd.collect()
The first statements rdd.count() is normally executed, while the second statement is always blocked by RUNNING status as the follow picture show:
However, when I take a look at the log, it shows that the task has finished.
18/11/09 16:45:30 INFO executor.Executor: Finished task 3.0 in stage 26.0 (TID 555). 197621638 bytes result sent via BlockManager)
What's the problem?
The spark version is pyspark==2.2.1, cluster is spark on yarn.

Spark, Why does dropping columns cause spark job to fail?

In Spark 2.0, I am running a pyspark job where I read from a table, add some columns whose logic is based off of windowing on 30 days worth of data and then I use df.createOrReplaceTempView followed up with spark.sql(create table as select * from ...) to create a table in HDFS.
This job runs successfully and creates a table in HDFS. However, I don't need all of the columns I just created in my dataframe. I only need half of the new columns and so I add some logic to drop the columns I don't need (all of these columns that will be dropped were recently created). When I run the drop `df = df.select([c for c in df.columns if c not in ('a','b','d','e')]) the spark job now fails!
error: Job aborted due to stage failure: Task 139 in stage 1.0 failed 4 times, most recent failure: Lost task 139.3 in stage 1.0 (TID 405, myhost, executor 197): ExecutorLostFailure (executor 197 exited caused by one of the running tasks) Reason: Container marked as failed: container_111 on host: myhost. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
You can use .drop("colname") to drop the columns from dataframe.
df1=df.drop("a","b","c","d")
Hope it helps you.

Where can one find exceptions thrown by Spark's Executors

We have been getting ExecutorLostExceptions, but have been unable to determine the root cause.
Here is a simplified script that can create the error
filenames = "hdfs://myfile1,hdfs://myfile2"
sc.textFile(filenames).first()
As an experiment, when I intentionally run a spark job on 1GB of data with only 1mb of spark.executor.memory, the driver prints the following error message
16/04/28 17:28:54 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0,
host.addr, partition 0,ANY, 2257 bytes)
16/04/28 17:29:28 INFO MesosSchedulerBackend: Executor lost: 4e199be7-a0bc-407d-ba70-4147e08d6c39-S5, marking slave 4e199be7-a0bc-407d-ba70-4147e08d6c39-S5 as lost
16/04/28 17:29:28 INFO MesosSchedulerBackend: Mesos slave lost: 4e199be7-a0bc-407d-ba70-4147e08d6c39-S5
16/04/28 17:29:28 ERROR TaskSchedulerImpl: Lost executor 4e199be7-a0bc-407d-ba70-4147e08d6c39-S5 on host.addr: Unknown executor exit code (256) (died from signal 128?)
16/04/28 17:29:28 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, host.addr): ExecutorLostFailure (executor 4e199be7-a0bc-407d-ba70-4147e08d6c39-S5 exited caused by one of the running tasks) Reason: Unknown executor exit code (256) (died from signal 128?)
and after a few automated retries, the entire job fails. This happens in both Pyspark and Scala-spark.
What are the appropriate logs I can look at to determine exactly why this executor failed?
For this controlled case I know that running out of memory was the cause. However these and other failures with different exit codes occur on a regular basis, and then I don't know where to look or what to fix.
The places I have looked so far include
The spark UI running on port 4040
/tmp/mesos/slaves/[slaveid]/frameworks/[frameworkid]/executors/[executorid]/runs/latest/{stderr,stdout} on the node whose executor was "lost"
/var/logs/mesos/mesos-slaves.{INFO,WARN,ERROR,FATAL} on the failed node
/tmp/spark-events/[executorid] on the driver node
Those places have helped address some issues, but not e.g. OOM errors, and now I'm not sure where else to look.
You can check under
$HADOOP_HOME/logs/userlogs
there you can find your logs by app id. app id can be found in hadoop cluster web ui:
<your_cluster_master_ip>:8088

Resources