Batch Read inside flatMapGroupWithState - apache-spark

I've been looking for a way to perform an update to a streaming application's state based on the data already present in this state. The application determines if there is missing data via the processing done inside the flatMapGroupWithState function and should perform a read to fetch this missing data.
Is it possible to perform such an operation?
The flatMapGroupWithState is computed on the executor so the SparkSession isn't available. Trying to use SparkSession.getActiveSession or when trying to create a new SparkSession results in the following error:
java.lang.IllegalStateException: No active or default Spark session found
at org.apache.spark.sql.SparkSession$.$anonfun$active$2(SparkSession.scala:1055)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession$.$anonfun$active$1(SparkSession.scala:1055)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession$.active(SparkSession.scala:1054)
at org.apache.spark.sql.delta.sources.DeltaDataSource.getTable(DeltaDataSource.scala:68)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:83)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:274)
at scala.Option.map(Option.scala:230)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:248)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)

Related

Access Spark Hive metastore within an UDF running in the workers (Databricks)

Context
I have an operation that should be performed on some tables using pyspark.
This operation includes accessing the Spark metastore (in Databricks) to get some metadata.
Since I have plenty of tables I'm parallelizing this operation among the cluster workers with an RDD, as you can see in the code below:
base_spark_context = SparkContext.getOrCreate()
rdd = base_spark_context.sc.parallelize(tables_list)
rdd.map(lambda table_name: sync_table(table_name)).collect()
The UDF sync_table() run queries on the metastore, similar to this code line:
spark_client.session.sql("select 1")
Problem
The problem is that this SQL execution not succeeds. Rather I get some metastore related error. Traceback:
py4j.protocol.Py4JJavaError: An error occurred while calling o20.sql.
: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
(suppressed lines)
Caused by: java.lang.reflect.InvocationTargetException
(suppressed lines)
Caused by: javax.jdo.JDOFatalDataStoreException: Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
java.sql.SQLException: Failed to start database 'metastore_db' with class loader sun.misc.Launcher$AppClassLoader#16c0663d, see the next exception for details.
(suppressed lines)
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /databricks/spark/work/app-20210413201900-0000/0/metastore_db.
Is there any limitation accessing the Databricks metastore within a worker, after parallelizing the operation in such a way? Or there is a possibility of performing such an operation?

Spark SQL - org.apache.spark.sql.AnalysisException

The error described below occurs when I run Spark job on Databricks the second time (the first less often).
The sql query just performs create table as select from registered temp view from DataFrame.
The first idea was spark.catalog.clearCache() in the end of the job (did't help).
Also I found some post on databricks forum about using object ... extends App (Scala) instead of main method (didn't help again)
P.S. current_date() is the built-in function and it should be provided automatically (expected)
Spark 2.4.4, Scala 2.11, Databricks Runtime 6.2
org.apache.spark.sql.AnalysisException: Undefined function: 'current_date'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 21 pos 4
at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$50.apply(Analyzer.scala:1318)
at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$50.apply(Analyzer.scala:1318)
at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15.applyOrElse(Analyzer.scala:1317)
at org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15.applyOrElse(Analyzer.scala:1309)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:279)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:279)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:76)```
Solution, ensure spark initialized every time when job is executed.
TL;DR,
I had similar issue and that object extends App solution pointed me in right direction. So, in my case I was creating spark session outside of the "main" but within object and when job was executed first time cluster/driver loaded jar and initialised spark variable and once job has finished execution successfully (first time) the jar is kept it in memory but link to spark is lost for some reason and any subsequent execution does not reinitialize spark as jar is already loaded and in my case spark initilisation was outside main and hence was not re-initilised. I think it's not an issue for Databricks jobs that create cluster and run or start cluster before execution (as these are similar to first time start case) and only related to clusters that already up and running as jars are loaded during either cluster start up or job execution.
So, I moved spark creation i.e. SparkSession.builder()...getOrCreate() to the "main" and so when job called so does spark session gets reinitialized.
current_date() is the built-in function and it should be provided
automatically (expected)
This expectation is wrong. you have to import the functions
for scala
import org.apache.spark.sql.functions._
where current_date function is available.
from pyspark.sql import functions as F
for pyspark

Spark not able to find checkpointed data in HDFS after executor fails

I am sreaming data from Kafka as below:
final JavaPairDStream<String, Row> transformedMessages =
rtStream
.mapToPair(record -> new Tuple2<String, GenericDataModel>(record.key(), record.value()))
.mapWithState(StateSpec.function(updateDataFunc).numPartitions(32)).stateSnapshots()
.foreachRDD(rdd -> {
--logic goes here
});
I have four workers threads, and multiple executors for this application, and i am trying to check fault tolerance of Spark.
Since we are using mapWithState, spark is checkpointing data to HDFS, so if any executor/worker goes down , we should be able to recover the lost data (data lost in the dead executor), and continue with leftover executors/workers.
So i kill one of the workers nodes to see if the application still runs smoothly, but instead i get an exception of FileNotFound in HDFS as below:
This is a bit odd, as Spark checkpointed data at sometime in HDFS, why is it not able to find it. Obviously HDFS is not deleting any data, so why this exception.
Or am i missing something here?
[ERROR] 2018-08-21 13:07:24,067 org.apache.spark.streaming.scheduler.JobScheduler logError - Error running job streaming job 1534871220000 ms.2
org.apache.spark.SparkException: Job aborted due to stage failure: Task creation failed: java.io.FileNotFoundException: File does not exist: hdfs://mycluster/user/user1/sparkCheckpointData/2db59817-d954-41a7-9b9d-4ec874bc86de/rdd-1005/part-00000
java.io.FileNotFoundException: File does not exist: hdfs://mycluster/user/user1/sparkCheckpointData/2db59817-d954-41a7-9b9d-4ec874bc86de/rdd-1005/part-00000
at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1122)
at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
at org.apache.spark.rdd.ReliableCheckpointRDD.getPreferredLocations(ReliableCheckpointRDD.scala:89)
at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$1.apply(RDD.scala:273)
at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$1.apply(RDD.scala:273)
at scala.Option.map(Option.scala:146)
at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:273)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1615)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1626)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1625)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$1.apply(DAGScheduler.scala:1625)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1625)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2.apply(DAGScheduler.scala:1623)
Further Update:
I found that the RDD that Spark is trying to find in HDFS is already deleted by "ReliableRDDCheckpointData" process and it created a new RDD for the checkpoint data.
DAG is pointing to this old RDD somehow. Had there been any reference to this data, it shouldn't have been deleted.
Consider this pipeline of transformation on a Spark stream:
rtStream
.mapToPair(record -> new Tuple2<String, GenericDataModel>(record.key(), record.value()))
.mapWithState(StateSpec.function(updateDataFunc).numPartitions(32)).stateSnapshots()
.foreachRDD(rdd -> {
if(counter ==1){
--convert RDD to Dataset, and register it as a SQL table names "InitialDataTable"
} else
--convert RDD to Dataset, and register it as a SQL table names "ActualDataTable"
});
mapWithState is associated with automatic checkpointing of state data after every batch, so each "rdd" in the above "forEachRdd" block is checkpointed , and while checkpointing, it overwrites the previous checkpoint (because obviously the latest state needs to stay in the checkpoint)
but lets say if the user is still using the rdd number 1, as in my case i am registering the very 1st rdd as a different table, and every other rdd as a different table, then it shouldnot be overwritten. (its same in java, if something is referring to a object reference , that object will not be eligible for garbage collection)
Now, when i try to access the table "InitialDataTable", obviously the "rdd" used to create this table is no more in memory, so it will go to HDFS to recover that from the checkpoint, and it will not find it there as well because it was overwritten by the very next rdd, and the spark application stops citing the reason.
"org.apache.spark.SparkException: Job aborted due to stage failure: Task creation failed: java.io.FileNotFoundException: File does not exist: hdfs://mycluster/user/user1/sparkCheckpointData/2db59817-d954-41a7-9b9d-4ec874bc86de/rdd-1005/part-00000"
So to resolve this issue, i had to checkpoint the very first rdd explicitly.

Spark Structured Streaming - Insert into existing Hive table with scalability without errors

This seems like a fairly straight-forward thing to do and you would think the Spark developers would have built in a way to do this but I could not find one. Options I have seen or tried are:
Write to Parquet file and create external hive table but I want to insert into existing Hive internal table. I know I can add this data simply using another spark job to periodically add this data but that is not ideal
Write a custom ForeachWriter which I have tried with limited success:
val ds1 = spark.readStream.format("kafka").option("kafka.bootstrap.servers", hostPort).option("subscribe", "test").load()
val lines = ds1.selectExpr("CAST(value AS STRING)").as[String]
val words = lines.flatMap(_.split(" "))
import org.apache.spark.sql.ForeachWriter
val writer = new ForeachWriter[String] {
import org.apache.spark.sql.SparkSession
override def open(partitionId: Long, version: Long) = true
override def process(value: String) = {
val sparksess = SparkSession.builder.master("local[*]").enableHiveSupport().getOrCreate()
import sparksess.implicits._
val valData = List(value).toDF
valData.write.mode("append").insertInto("stream_test")
sparksess.close
}
override def close(errorOrNull: Throwable) = {}
}
val q = words.writeStream.queryName("words-app").foreach(writer)
val query = q.start()
NOTE: It would seem obvious to create Sparksession in the Open method and close the session in the close method but these did not work for more than one iteration then failed
Note: The above worked even past first iteration but there are Spark errors being thrown that do not make me comfortable that this solution will scale:
18/03/15 11:10:52 ERROR cluster.YarnScheduler: Lost executor 2 on hadoop08.il.nds.com: Container marked as failed: container_1519892135449_0122_01_000003 on host: hadoop08.il.nds.com. Exit status: 50. Diagnostics: Exception from container-launch.
Container id: container_1519892135449_0122_01_000003
Exit code: 50
Stack trace: ExitCodeException exitCode=50:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:601)
at org.apache.hadoop.util.Shell.run(Shell.java:504)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:786)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Container exited with a non-zero exit code 50
Note: There was a similar question asked here:
But the answer does not say how it is do the Streaming context and was left unanswered
Can anyone help fix this code so that it scales better without errors?
Write back to a Kafka sink and then have Kafka write to Hive using KafkaConnect like it says here:
However, I want to avoid this extra complexity.
Write to JDBCSink - this will be my next attempt but it seems like it should not be necessary for Hive within Spark!!
I have read that DataBricks is developing something similar - see here:
But it is not clear when and if this will be released as part of standard Spark GA release.
Any help especially from Spark developers regarding the best way to accomplish stream-writing data to existing Hive tables would be very much appreciated.
Edit:
I wrote a JDBC Sink extending the ForeachWriter and it works but it is very slow since it has to open and close a connection for each row! I wanted to try to create a connection pool that would reuse and release connections but read that it was not possible with the ForeachWriter and that I would have to write my own custom JDBC sink to do this and that it was complicated here.
I am planning on one of 2 options 1. For the sake of simplicity and until Spark releases a built-in solution, write out data to parquet file accessible by hive external table and write job to periodically to move data to managed table. 2. Use older Spark Streaming DStream which seems to have ability to use ForeachRDD to lazily create singleton sessions that can be reused but then I assume I would lose the "exactly once" features of Structured Streaming and would have to completely rewrite my code when this capability is built into Structured Streaming.

An error occurred when writing a pickled representation of spark rdd to file

I use the following code to persist a spark rdd.
rdd = sc.parallelize([1,2,3])
file = open('test','w')
import pickle
pickle.dump(rdd, file)
and the error message is:
Py4JError: An error occurred while calling o550.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:722)
I think a rdd is similar to a handle, and I don't have to save all data in hdfs for use next time.
so, the pyspark rdd objects could be persisted? if not, why? how to save a rdd
object for accessing in anothor runtime using an elegant method?
Have you considered just saving as Pickle file using the saveAsPickleFile method available in the SparkContext?
rdd = sc.parallelize([1,2,3])
rdd.saveAsPickleFile('user/cloudera/parallalized_collection')
From the documentation
saveAsPickleFile(path, batchSize=10)
Save this RDD as a SequenceFile of serialized objects. The serializer used is pyspark.serializers.PickleSerializer, default batch size is 10.
RDD is a proxy for Java object. To serialize it correctly you'd have to serialize both Java and Python objects. Unfortunately this won't help you at all. While JVM RDD is Serializable, it so only for internal purposes:
Spark does not support performing actions and transformations on copies of RDDs that are created via deserialization. RDDs are serializable so that certain methods on them can be invoked in executors, but end users shouldn't try to manually perform RDD serialization.
To address your question:
how to save a rdd object for accessing in anothor runtime using an elegant method?
If you're interested in preserving data use one of the output methods (RDD.saveAs*).
Otherwise create RDD from the beginning - the cost is negligible as it is only a recipe.

Resources