how to avoid spark-submit cache - apache-spark

spark-submit job is put in CDH, there is a weird thing. It always complains a query (XXX in below), but this query is not in the current application, it was an OLD query used it before and deleted. Looks like there is some cache somewhere.
The code is simple, var extract = sqlContext.sql(".....")
How to fix it ? thanks.
16/11/13 22:12:29 INFO DAGScheduler: Job 1 finished: aggregate at InferSchema.scala:41, took 3.032230 s
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'XXX' (string and boolean).;
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
Thanks.

You may need to remove the old jar and rebuild it for execution.

Related

Spark saveAsTable with location at s3 bucket's root cause NullPointerException

I am working with Spark 3.0.1 and my partitioned table is stored in s3. Please find here the description of the issue.
Create Table
Create table root_table_test_spark_3_0_1 (
id string,
name string
)
USING PARQUET
PARTITIONED BY (id)
LOCATION 's3a://MY_BUCKET_NAME/'
Code that is causing the NullPointerException on the second run
Seq(MinimalObject("id_1", "name_1"), MinimalObject("id_2", "name_2"))
.toDS()
.write
.partitionBy("id")
.mode(SaveMode.Append)
.saveAsTable("root_table_test_spark_3_0_1")
When the Hive metastore is empty everything is ok but the issue is happening when spark is trying to do the getCustomPartitionLocations in InsertIntoHadoopFsRelationCommand phase. (on the second run for example)
Indeed it calls the below method : from (org.apache.hadoop.fs.Path)
/** Adds a suffix to the final name in the path.*/
public Path suffix(String suffix) {
return new Path(getParent(), getName()+suffix);
}
But the getParent() will return null when we are at root, resulting in a NullPointerException. The only option i'm thinking at the moment is to override this method to do something like :
/** Adds a suffix to the final name in the path.*/
public Path suffix(String suffix) {
return (isRoot()) ? new Path(uri.getScheme(), uri.getAuthority(), suffix) : new Path(getParent(), getName()+suffix);
}
Anyone having issues when LOCATION of a spark hive table is at root level? Any workaround? Is there any known issues opened?
My Runtime does not allow me to override the Path class and fix the suffix method and i can't move my data from the bucket's root as it exists since 2 years now.
The issue happen because i'm migrating from Spark 2.1.0 to Spark 3.0.1 and the behavior checking the custom partitions appeared in Spark 2.2.0 (https://github.com/apache/spark/pull/16460)
This whole context help to understand the problem but basically you can reproduce it easily doing
val path: Path = new Path("s3a://MY_BUCKET_NAME/")
println(path.suffix("/id=id"))
FYI. the hadoop-common version is 2.7.4 and please find here the full stacktrace
NullPointerException
at org.apache.hadoop.fs.Path.<init>(Path.java:104)
at org.apache.hadoop.fs.Path.<init>(Path.java:93)
at org.apache.hadoop.fs.Path.suffix(Path.java:361)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.$anonfun$getCustomPartitionLocations$1(InsertIntoHadoopFsRelationCommand.scala:262)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.getCustomPartitionLocations(InsertIntoHadoopFsRelationCommand.scala:260)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:107)
at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:575)
at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:218)
at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:166)
Thanks
Looks like a situation where the spark code calls Path.suffix("something) and because the root path has no parent, an NPE is triggered
Long term fix
File JIRA on issues.apache.org against HADOOP; provide a patch with test for fix suffix() to downgrade properly when called on root path. Best for all
don't use the root path as a destination of a table.
Do both of these
Option #2 should avoid other surprises about how tables are created/committed etc...some of the code may fail because an attempt to delete the root of the path (here s3a://some-bucket") won't delete the root, will it?
Put differently: root directories have "odd" semantics everywhere; most of the time you don't notice this on a local FS because you never try to use / as a destination of work, get surprised that rm -rf / is different from rm -rf /subdir, etc etc. Spark, Hive etc were never written to use / as a destination of work, so you get to see the failures.

org.apache.spark.sql.AnalysisException: Undefined function: 'coalesce'

spark (2.4.5) is throwing the following error when trying to execute a select query similar to one shown below.
org.apache.spark.sql.AnalysisException: Undefined function: 'coalesce'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 12
SELECT cast(coalesce(column1,'') as string) as id,cast(coalesce(column2,'2020-01-01') as date) as date1
from 4dea68ed921940e58f027e7146d495a4
Table 4dea68ed921940e58f027e7146d495a4 is a temp view created in spark from dataframe.
This error is happening intermittently only after certain processes. Any help would be much appreciated.
The spark job is submitted through livy. Job contains two optional parameters and only one was provided. Providing all the parameters resolved the issue. Don't know why not providing an optional parameter caused this weird behavior but resolved the issue

FileNotFoundException: Spark save fails. Cannot clear cache from Dataset[T] avro

I get the following error when saving a dataframe in avro for a second time. If I delete sub_folder/part-00000-XXX-c000.avro after saving, and then try to save the same dataset, I get the following:
FileNotFoundException: File /.../main_folder/sub_folder/part-00000-3e7064c0-4a82-424c-80ca-98ce75766972-c000.avro does not exist. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
If I delete not only from sub_folder, but also from main_folder, then the problem doesn't happen, but I can't afford that.
The problem actually doesnt happen when trying to save the dataset in any
other format.
Saving an empty dataset does not cause an error.
The example suggests that the tables need to be refreshed, but as the output of sparkSession.catalog.listTables().show() there are no tables to refresh.
+----+--------+-----------+---------+-----------+
|name|database|description|tableType|isTemporary|
+----+--------+-----------+---------+-----------+
+----+--------+-----------+---------+-----------+
The previously saved dataframe looks like this. The application is supposed to update it:
+--------------------+--------------------+
| Col1 | Col2 |
+--------------------+--------------------+
|[123456, , ABC, [...|[[v1CK, RAWNAME1_,..|
|[123456, , ABC, [...|[[BG8M, RAWNAME2_...|
+--------------------+--------------------+
For me this is a clear cache problem. However, all attemps of clearing the cache have failed:
dataset.write
.format("avro")
.option("path", path)
.mode(SaveMode.Overwrite) // Any save mode gives the same error
.save()
// Moving this either before or after saving doesnt help.
sparkSession.catalog.clearCache()
// This will not un-persist any cached data that is built upon this Dataset.
dataset.cache().unpersist()
dataset.unpersist()
And this is how I read the dataset:
private def doReadFromPath[T <: SpecificRecord with Product with Serializable: TypeTag: ClassTag](path: String): Dataset[T] = {
val df = sparkSession.read
.format("avro")
.load(path)
.select("*")
df.as[T]
}
Finally the stack trace is this one. Thanks a lot for your help!:
ERROR [task-result-getter-3] (Logging.scala:70) - Task 0 in stage 9.0 failed 1 times; aborting job
ERROR [main] (Logging.scala:91) - Aborting job 150de02a-ac6a-4d42-824d-5db44a98c19a.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most recent failure: Lost task 0.0 in stage 9.0 (TID 11, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:254)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:168)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: File file:/DATA/XXX/main_folder/sub_folder/part-00000-3e7064c0-4a82-424c-80ca-98ce75766972-c000.avro does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:241)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:239)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:245)
... 10 more
*Reading from the same location and writing in to same location will give this issue. it was also discussed in this forum. along with my answer there *
and the below message in the error will mis lead. but actual issue is read/write from/in the same location.
You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL
I am giving another example other than yours (used parquet in your case avro).
I have 2 options for you.
Option 1 (cache and show will work like below...) :
import org.apache.spark.sql.functions._
val df = Seq((1, 10), (2, 20), (3, 30)).toDS.toDF("sex", "date")
df.show(false)
df.repartition(1).write.format("parquet").mode("overwrite").save(".../temp") // save it
val df1 = spark.read.format("parquet").load(".../temp") // read back again
val df2 = df1.withColumn("cleanup" , lit("Rod want to cleanup")) // like you said you want to clean it.
//BELOW 2 ARE IMPORTANT STEPS LIKE `cache` and `show` forcing a light action show(1) with out which FileNotFoundException will come.
df2.cache // cache to avoid FileNotFoundException
df2.show(2, false) // light action to avoid FileNotFoundException
// or println(df2.count) // action
df2.repartition(1).write.format("parquet").mode("overwrite").save(".../temp")
println("Rod saved in same directory where he read it from final records he saved after clean up are ")
df2.show(false)
Option 2 :
1) save the DataFrame with a different avro folder.
2) Delete the old avro folder.
3) Finally rename this newly created avro folder to the old name, will work.
Thanks a lot Ram Ghadiyaram!
The solution had 2 solved my problem but only in my local Ubuntu. When I tested in HDFS, the problem remained.
The solution 1 was the definite fix. This is how my code looks now:
private def doWriteToPath[T <: Product with Serializable: TypeTag: ClassTag](dataset: Dataset[T], path: String): Unit = {
// clear any previously cached avro
sparkSession.catalog.clearCache()
// update the cache for this particular dataset, and trigger an action
dataset.cache().show(1)
dataset.write
.format("avro")
.option("path", path)
.mode(SaveMode.Overwrite)
.save()
}
Some remarks:
I had indeed checked that post, and attempted unsuccessfully the solution. I discarded that to be my problem, for the following reasons:
I had created a /temp under 'main_folder', called 'sub_folder_temp', and saving still failed.
Saving the same non-empty dataset in the same path but in json format actually works without the workaround discussed here.
Saving an empty dataset with the same type [T] in the same path actually works without the workaround discussed here.

Java code executed by Databricks/Spark job - settings for one executor

In our current system there is one Java code that is reading one file and it will generate many JSON documents for the full day - 24h; all JSON docs are written to CosmosDB. When I execute it in the console everything is OK. I have tried to schedule a Databricks job by using the uber-jar file and it failed with the following error:
"Resource with specified id or name already exists."
It seems ok... IMO because the default settings of the existing cluster contain many executors - so each executor will try to write to CosmosDB the same set of JSON docs.
So I changed the main method as below:
public static void main(String[] args) {
SparkConf conf01 = new SparkConf().set("spark.executor.instances","1").set("spark.executor.cores","1");
SparkSession spark = SparkSession.builder().config(conf01).getOrCreate();
...
}
But I received the same error "Resource with specified id or name already exists" from the CosmosDB.
I would like to have only one executor for this specific Java code, how to use only one spark executor?
Any help (link/doc/url/code) will be appreciated.
Thank you !

JavaDStream print() function not printing

I am new to Spark streaming.
I followed the tutorial from this link : https://spark.apache.org/docs/latest/streaming-programming-guide.html
When I ran the code, I could see the line was being processed, but I could not see output with timestamp.
I only could see this log:
14/10/22 15:24:17 INFO scheduler.ReceiverTracker: Stream 0 received 0 blocks
14/10/22 15:24:17 INFO scheduler.JobScheduler: Added jobs for time 1414005857000 ms
.....
Also I was trying to save last DStream with forEachRDD function call, the data was not being stored.
If anyone can help me with this, would be a great help..
I met the same problem, here is how I solved:
change
val conf = new SparkConf().setMaster("local")
to
val conf = new SparkConf().setMaster("local[*]")
It's a mistake to setMaster("local"), which will not calculate actually.
Hope this is the problem you meet.
The print is working as evidenced by the ..... separator, only there's nothing to print: the DStream is empty. The log provided actually shows that: Stream 0 received 0 blocks.
Make sure you're sending data correctly to your Receiver.
val conf = new SparkConf().setMaster("local[*]") works
local[*]: '*' means create the worker thread as the same number as the kernel number of CPU
if using "local", no worker is created, why default is not 1, isn't it a issue?
refer to.
What does setMaster `local[*]` mean in spark?

Resources