AWS Glue - User did not initialize spark context - apache-spark

I am working in AWS glue.i am reading a hive metastore based table from AWS Glue Catalog from spark scala job in AWS glue with my custom spark code, please note i am writing my own code its our need. Code is working as expected, it is reading source table and loading to target table as well but still job goes to error every time.
Here is my sparksession
val spark = SparkSession.builder().appName("SPARK-Dev")
.master("local[*]")
.enableHiveSupport()
.getOrCreate
Job throws this error
2020-03-27 17:07:53,282 ERROR [main] yarn.ApplicationMaster (Logging.scala:logError(91)) - Uncaught exception:
java.lang.IllegalStateException: User did not initialize spark context!
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:485)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)

You're are using master("local[*]"). I'm not sure that it's correct with cluster. Try to use it withought master func

Related

Access Spark Hive metastore within an UDF running in the workers (Databricks)

Context
I have an operation that should be performed on some tables using pyspark.
This operation includes accessing the Spark metastore (in Databricks) to get some metadata.
Since I have plenty of tables I'm parallelizing this operation among the cluster workers with an RDD, as you can see in the code below:
base_spark_context = SparkContext.getOrCreate()
rdd = base_spark_context.sc.parallelize(tables_list)
rdd.map(lambda table_name: sync_table(table_name)).collect()
The UDF sync_table() run queries on the metastore, similar to this code line:
spark_client.session.sql("select 1")
Problem
The problem is that this SQL execution not succeeds. Rather I get some metastore related error. Traceback:
py4j.protocol.Py4JJavaError: An error occurred while calling o20.sql.
: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
(suppressed lines)
Caused by: java.lang.reflect.InvocationTargetException
(suppressed lines)
Caused by: javax.jdo.JDOFatalDataStoreException: Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
java.sql.SQLException: Failed to start database 'metastore_db' with class loader sun.misc.Launcher$AppClassLoader#16c0663d, see the next exception for details.
(suppressed lines)
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /databricks/spark/work/app-20210413201900-0000/0/metastore_db.
Is there any limitation accessing the Databricks metastore within a worker, after parallelizing the operation in such a way? Or there is a possibility of performing such an operation?

Exception java.util.NoSuchElementException: None.get in Spark Dataset save() operation

I got the exception "java.util.NoSuchElementException: None.get" when I tried to save Dataset to s3 storage as parquet:
The exception:
java.lang.IllegalStateException: Failed to execute CommandLineRunner
at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:787)
at org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:768)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:322)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1226)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1215)
...
Caused by: java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at org.apache.spark.sql.execution.datasources.BasicWriteJobStatsTracker$.metrics(BasicWriteStatsTracker.scala:173)
at org.apache.spark.sql.execution.command.DataWritingCommand$class.metrics(DataWritingCommand.scala:51)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.metrics$lzycompute(InsertIntoHadoopFsRelationCommand.scala:47)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.metrics(InsertIntoHadoopFsRelationCommand.scala:47)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.metrics$lzycompute(commands.scala:100)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.metrics(commands.scala:100)
at org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:56)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:76)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566)
Looks like it's the issue related to the SparkContext.
I didn't create instance of SparkContext explicit, instead, I use SparkSession only in my source code.
final SparkSession sparkSession = SparkSession
.builder()
.appName("Java Spark SQL job")
.getOrCreate();
ds.write().mode("overwrite").parquet(path);
Any suggestions or work around? thanks
Update 1:
The creation of ds is a little complicated but I will try to list the main call stacks as below:
Process 1:
session.read().parquet(path) as source;
ds.createOrReplaceTempView(view);
sparkSession.sql(sql) as ds1;
sparkSession.sql(sql) as ds2;
ds1.save()
ds2.save()
Process 2:
After step6, I loop back to step 1 with the same spark session for next process.
finally sparkSession.stop() is called after all processed.
I can find the log after process 1 completed, which looks like indicating the SparkContext has been destroyed before the process 2:
INFO SparkContext: Successfully stopped SparkContext
Just simply remove sparkSession.stop() solved this issue.

Databricks -> Snowflake: SQL compilation error: Stage: 'XYZ' cannot be a temporary stage in the pipe definition

I try to materialize stream from Databricks into Snowflake table:
parsedStream
.writeStream
.outputMode("append")
.options(options)
.option("dbtable", "test_table")
.option("streaming_stage","test_stage")
.option("checkpointLocation","/demo-checkpoints")
.format("snowflake")
.start()
options contains all the details necessary to authenticate to snowflake and this part works.
I checked both precreated stage and non-existing stage so that Databricks would create temporal stage (and this is how it is supposed to be working).
I get the following error:
net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation error: Stage: 'test_stage' cannot be a temporary stage in the pipe definition.
at net.snowflake.client.jdbc.SnowflakeUtil.checkErrorAndThrowExceptionSub(SnowflakeUtil.java:135)
at net.snowflake.client.jdbc.SnowflakeUtil.checkErrorAndThrowException(SnowflakeUtil.java:60)
at net.snowflake.client.core.StmtUtil.pollForOutput(StmtUtil.java:503)
at net.snowflake.client.core.StmtUtil.execute(StmtUtil.java:370)
at net.snowflake.client.core.SFStatement.executeHelper(SFStatement.java:474)
at net.snowflake.client.core.SFStatement.executeQueryInternal(SFStatement.java:230)
at net.snowflake.client.core.SFStatement.executeQuery(SFStatement.java:172)
at net.snowflake.client.core.SFStatement.execute(SFStatement.java:663)
at net.snowflake.client.jdbc.SnowflakeStatementV1.executeQueryInternal(SnowflakeStatementV1.java:161)
at net.snowflake.client.jdbc.SnowflakePreparedStatementV1.executeQuery(SnowflakePreparedStatementV1.java:153)
at net.snowflake.spark.snowflake.JDBCWrapper$$anonfun$executePreparedQueryInterruptibly$1.apply(SnowflakeJDBCWrapper.scala:257)
at net.snowflake.spark.snowflake.JDBCWrapper$$anonfun$executePreparedQueryInterruptibly$1.apply(SnowflakeJDBCWrapper.scala:255)
at net.snowflake.spark.snowflake.JDBCWrapper$$anonfun$2.apply(SnowflakeJDBCWrapper.scala:292)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
import net.snowflake.spark.snowflake.Utils.SNOWFLAKE_SOURCE_NAME
res17: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper#1f36b6a9
Any ideas?
Spark Streaming (using the function writeStream) with Snowflake is currently not generally available as a feature. You might want to rewrite your application to use the currently available Spark connector (which does not support Structured Streaming).
https://docs.snowflake.net/manuals/user-guide/spark-connector.html

spark structured streaming unable to start from checkpoint location

I'm doing a simple Spark program using structured streaming feature and Kafka. As Kafka is source, there are 2 sinks:
Sink 1- Console sink -- works fine in all cases
Sink 2 & 3 -H2 and Ignite Foreach sink
For the first run code runs fine but when I kill and restart the program with checkpoint location I'm getting the below error
17/07/12 07:11:48 ERROR StreamExecution: Query h2Out [id = 22ce7168-6f12-4220-8f28-f9eaaaba9c6a, runId = 39ecb40a-5b54-4b36-a0da-6e3057d66b2e] terminated with error
java.lang.NoSuchMethodError: org.apache.spark.sql.kafka010.KafkaSource$$anon$1.parseVersion(Ljava/lang/String;I)I
at org.apache.spark.sql.kafka010.KafkaSource$$anon$1.deserialize(KafkaSource.scala:116)
at org.apache.spark.sql.kafka010.KafkaSource$$anon$1.deserialize(KafkaSource.scala:99)
at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:237)
at org.apache.spark.sql.kafka010.KafkaSource.initialPartitionOffsets$lzycompute(KafkaSource.scala:129)
at org.apache.spark.sql.kafka010.KafkaSource.initialPartitionOffsets(KafkaSource.scala:97)
at org.apache.spark.sql.kafka010.KafkaSource.getBatch(KafkaSource.scala:222)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:452)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:448)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2.apply(StreamExecution.scala:448)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2.apply(StreamExecution.scala:448)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:262)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:46)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:447)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply$mcV$sp(StreamExecution.scala:255)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply(StreamExecution.scala:244)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply(StreamExecution.scala:244)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:262)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:46)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:244)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:43)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:239)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:177)
I checked KafkaSource source code, the parseFunction method is available through org.apache.spark.sql.execution.streaming.HDFSMetadataLog I hope, for which the jar (spark-sql_2.11-2.1.1.jar) is available in classpath.
For info I'm using Kafka 0.10.2.1 maven dependencies.
This error means your Spark version is older than 2.1.1. HDFSMetadataLog.parseVersion adds in Spark 2.1.1, and spark-sql-kafka-0-10_2.11-2.1.1.jar calls it. If your Spark version is older than 2.1.1, you will see this NoSuchMethodError.
You can check your Spark version by calling SparkSession.version. (e.g., just type spark.version in Spark shell).

Hive on Spark CDH5.7 Execution Error

I've updated my cluster to CDH 5.7 recently and I am trying to run a Hive query processing on Spark.
I have configured the Hive client to use the Spark execution engine and Hive Dependency on a Spark Service from Cloudera Manager.
Via HUE, i'm simply running a simple select query but seem to get this error at all times: Error while processing statement: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
Following are the logs for the same:
ERROR operation.Operation: Error running hive query:
org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:374)
at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:180)
at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:72)
at org.apache.hive.service.cli.operation.SQLOperation$2$1.run(SQLOperation.java:232)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hive.service.cli.operation.SQLOperation$2.run(SQLOperation.java:245)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Any help to solve this would be great!
This problem is due to a open JIRA: https://issues.apache.org/jira/browse/HIVE-11519. You should use another serialization tool..
Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
is not the real error message, you'd better turn on the DEBUG info by using hive cli, like
bin/hive --hiveconf hive.root.logger=DEBUG,console
and you will get more detailed logs, such as, those are something i got before:
16/03/17 13:55:43 [fxxxxxxxxxxxxxxxx4 main]: INFO exec.SerializationUtilities: Serializing MapWork using kryo
java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handledType()Ljava/lang/Class;
this is caused by some dependency conflicts, see https://issues.apache.org/jira/browse/HIVE-13301 for detail.

Resources