I have a Spark streaming job which reads Cosmos Changefeed data as below, running in a Databricks cluster with DBR 8.2.
cosmos_config = {
"spark.cosmos.accountEndpoint": cosmos_endpoint,
"spark.cosmos.accountKey": cosmos_key,
"spark.cosmos.database": cosmos_database,
"spark.cosmos.container": collection,
"spark.cosmos.read.partitioning.strategy": "Default",
"spark.cosmos.read.inferSchema.enabled" : "false",
"spark.cosmos.changeFeed.startFrom" : "Now",
"spark.cosmos.changeFeed.mode" : "Incremental"
}
df_ read = (spark.readStream
.format("cosmos.oltp.changeFeed")
.options(**cosmos_config)
.schema(cosmos_schema)
.load())
df_write = (df_ read.withColumn("partition_date",current_date())
.writeStream
.partitionBy("partition_date")
.format('delta')
.option("path", master_path)
.option("checkpointLocation", f"{master_path}_checkpointLocation")
.queryName("cosmosStream")
.trigger(processingTime='10 seconds')
.start()
)
While the job works well ordinarily, occasionally, the streaming stops all of a sudden and the below appears in a loop in the log4j output. Restarting the job processes all the data in the 'backlog'. Has anyone experienced something like this before? I'm not sure what could be causing this. Any ideas?
22/02/27 00:57:58 INFO HiveMetaStore: 1: get_database: default
22/02/27 00:57:58 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default
22/02/27 00:57:58 INFO DriverCorral: Metastore health check ok
22/02/27 00:58:07 INFO HikariDataSource: metastore-monitor - Starting...
22/02/27 00:58:07 INFO HikariDataSource: metastore-monitor - Start completed.
22/02/27 00:58:07 INFO HikariDataSource: metastore-monitor - Shutdown initiated...
22/02/27 00:58:07 INFO HikariDataSource: metastore-monitor - Shutdown completed.
22/02/27 00:58:07 INFO MetastoreMonitor: Metastore healthcheck successful (connection duration = 88 milliseconds)
22/02/27 00:58:50 INFO RxDocumentClientImpl: Getting database account endpoint from https://<cosmosdb_endpoint>.documents.azure.com:443
Which version of the Cosmos Spark connector are you using? Between 4.3.0 and 4.6.2 there were several bug fixes made in the bulk ingestion code path.
See https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3-1_2-12/CHANGELOG.md for more details.
Related
I'm trying to use Databricks on Azure with a Spark structured streaming job and an having very mysterious issue.
I boiled the job down it it's basics for testing, reading from a Kafka topic and writing to console in a forEachBatch.
On local, everything works fine indefinitely.
On Databricks, the task terminates after just over 5 minutes with a "Cancelled" status.
There are no errors in the log, just this, which appears to be a graceful shutdown request of some kind, but I don't know where it's coming from
22/11/04 18:31:30 INFO DriverCorral$: Cleaning the wrapper ReplId-1ea30-8e4c0-48422-a (currently in status Running(ReplId-1ea30-8e4c0-48422-a,ExecutionId(job-774316032912321-run-84401-action-5645198327600153),RunnableCommandId(9102993760433650959)))
22/11/04 18:31:30 INFO DAGScheduler: Asked to cancel job group 2207618020913201706_9102993760433650959_job-774316032912321-run-84401-action-5645198327600153
22/11/04 18:31:30 INFO ScalaDriverLocal: cancelled jobGroup:2207618020913201706_9102993760433650959_job-774316032912321-run-84401-action-5645198327600153
22/11/04 18:31:30 INFO ScalaDriverWrapper: Stopping streams for commandId pattern: CommandIdPattern(2207618020913201706,None,Some(job-774316032912321-run-84401-action-5645198327600153)).
22/11/04 18:31:30 INFO DatabricksStreamingQueryListener: Stopping the stream [id=d41eff2a-4de6-4f17-8d1c-659d1c1b8d98, runId=5bae9fb4-b5e1-45a0-af1e-a2f2553592c9]
22/11/04 18:31:30 INFO DAGScheduler: Asked to cancel job group 5bae9fb4-b5e1-45a0-af1e-a2f2553592c9
22/11/04 18:31:30 INFO TaskSchedulerImpl: Cancelling stage 366
22/11/04 18:31:30 INFO TaskSchedulerImpl: Killing all running tasks in stage 366: Stage cancelled
22/11/04 18:31:30 INFO MicroBatchExecution: QueryExecutionThread.interruptAndAwaitExecutionThreadTermination called with streaming query exit timeout=15000 ms
For reference, here are is the code:
val incomingStream = spark.readStream
.format("kafka")
.option("subscribe",ehName)
.option("kafka.bootstrap.servers",topicUriWithPort)
.option("kafka.sasl.mechanism","PLAIN")
.option("kafka.security.protocol","SASL_SSL")
.option("kafka.sasl.jaas.config",jaas)
.option("startingOffsets", "earliest")
.option("failOnDataLoss","false")
.option("maxOffsetsPerTrigger", 1) //todo make config
.load()
val processedWriteStream = incomingStream
.writeStream
.queryName("query2")
.foreachBatch((d: DataFrame, b: Long) => {
d.show()
})
.start()
processedWriteStream.awaitTermination()
Structured Streaming provides fault-tolerance and data consistency for streaming queries; using Databricks workflows, you can easily configure your Structured Streaming queries to restart on failure automatically.
You can restart the query after a failure by enabling checkpointing for a streaming query.
The restarted query continues where the failed one left off.
I am trying to use databricks connect to run the spark job on databricks cluster from intellj .I followed the below link documentation.
https://docs.databricks.com/dev-tools/databricks-connect.html
However I could not make it work with intellj and it throws below exception
21/10/01 18:32:07 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
21/10/01 18:32:07 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
Exception in thread "main" java.lang.NoSuchFieldError: JAVA_9
at org.apache.spark.storage.StorageUtils$.<init>(StorageUtils.scala:207)
at org.apache.spark.storage.StorageUtils$.<clinit>(StorageUtils.scala)
at org.apache.spark.storage.BlockManagerMasterEndpoint.<init>(BlockManagerMasterEndpoint.scala:95)
at org.apache.spark.SparkEnv$.$anonfun$create$9(SparkEnv.scala:443)
at org.apache.spark.SparkEnv$.registerOrLookupEndpoint$1(SparkEnv.scala:384)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:432)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:262)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:291)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:495)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2834)
I could not find a workaround this as the documentation does not say anything clearly I cross checked from intellj its pointed to correct jar directory returned by (databricks-connect get-jar-dir).Any clue on this will be helpful?
Note:databricks-connect test is returning success
I have a single node spark cluster (4 cpu cores and 15GB of memory) configured with a single worker. I can access the web UI and see the worker node. However, I am having trouble submitting the jobs using spark-submit. I have couple of questions.
I have an uber-jar file stored in the cluster. I used the following command to submit a job spark-submit --class Main --deploy-mode cluster --master spark://cluster:7077 uber-jar.jar. This starts the job but fails immediately with the following log messages.
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/11/13 01:19:47 INFO SecurityManager: Changing view acls to: admin
19/11/13 01:19:47 INFO SecurityManager: Changing modify acls to: admin
19/11/13 01:19:47 INFO SecurityManager: Changing view acls groups to:
19/11/13 01:19:47 INFO SecurityManager: Changing modify acls groups to:
19/11/13 01:19:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(admin); groups with view permissions: Set(); users with modify permissions: Set(admin); groups with modify permissions: Set()
19/11/13 01:19:48 INFO Utils: Successfully started service 'driverClient' on port 46649.
19/11/13 01:19:48 INFO TransportClientFactory: Successfully created connection to cluster/10.10.10.10:7077 after 37 ms (0 ms spent in bootstraps)
19/11/13 01:19:48 INFO ClientEndpoint: Driver successfully submitted as driver-20191113011948-0010
19/11/13 01:19:48 INFO ClientEndpoint: ... waiting before polling master for driver state
19/11/13 01:19:53 INFO ClientEndpoint: ... polling master for driver state
19/11/13 01:19:53 INFO ClientEndpoint: State of driver-20191113011948-0010 is FAILED
19/11/13 01:19:53 INFO ShutdownHookManager: Shutdown hook called
19/11/13 01:19:53 INFO ShutdownHookManager: Deleting directory /tmp/spark-4da02cd2-5cfc-4a2a-ad10-41a594569ea1
what am I doing wrong and how do I correctly submit the job.
If my uber-jar file is in my local computer, how do I correctly use spark-submit to submit a spark job using the uber-jar file to the cluster from my local computer. I've experimented running spark-shell in my local computer by pointing to the standalone cluster using spark-shell --master spark:\\cluster:7077. This starts a spark shell in my local computer and I can see (in the spark web UI) the worker gets memory assigned to it in the cluster. However, if I try to perform a task in the shell I get the following error message.
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I am not able to create View on Hive table using HiveContext. Facing DBLock Manager Lock issue. Same View creation query works fine in Hive Beeline. But its failing when executed using Hive Context.
17/02/23 10:44:18 INFO metastore: Trying to connect to metastore with URI
thrift://XXXXXXXXXXXXXXXXXXXXXXXXXXX
17/02/23 10:44:18 INFO metastore: Connected to metastore.
17/02/23 10:44:18 INFO DbLockManager: Response to queryId=XXXXXXXX_20170223104411_2b1a475e-ad6d-45b3-8ec6-6a30a9123664 LockResponse(lockid:419, state:WAITING)
I am very new to spark and cassandra. I am trying a simple java progam where I am trying to add new rows to cassandra table using spark-cassandra-connector provided by datastax.
I am running dse on my laptop . Using java, I am trying to save the data to cassandra DB thru Spark . Following is the code :
Map<String, String> extra = new HashMap<String, String>();
extra.put("city", "bangalore");
extra.put("dept", "software");
List<User> products = Arrays.asList(new User(1, "vamsi", extra));
JavaRDD<User> productsRDD = sc.parallelize(products);
javaFunctions(productsRDD, User.class).saveToCassandra("test", "users");
When i execute this code I am getting following error
16/03/26 20:57:31 INFO client.AppClient$ClientActor: Connecting to master spark://127.0.0.1:7077...
16/03/26 20:57:44 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
16/03/26 20:57:51 INFO client.AppClient$ClientActor: Connecting to master spark://127.0.0.1:7077...
16/03/26 20:57:59 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
16/03/26 20:58:11 ERROR client.AppClient$ClientActor: All masters are unresponsive! Giving up.
16/03/26 20:58:11 ERROR cluster.SparkDeploySchedulerBackend: Spark cluster looks dead, giving up.
16/03/26 20:58:11 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/03/26 20:58:11 INFO scheduler.DAGScheduler: Failed to run runJob at RDDFunctions.scala:48
Exception in thread "main" org.apache.spark.SparkException: Job aborted: Spark cluster looks down
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Looks like you need to fix your Spark configuration...see this:
http://www.datastax.com/dev/blog/common-spark-troubleshooting