Kafka producer authentication within a databricks notebook - apache-spark

I am looking for a method to authenticate my databricks notebook to publish messages to a kafka topic which requires an IMS token for auth while using the spark kafka library to publish.
Does anyone have any idea how can I achieve this?
Thanks in advance.
I was trying the following command -:
df.write.format("kafka")
.option("kafka.sasl.mechanism", "PLAIN")
.option("kafka.security.protocol", "SASL_SSL")
.option("kafka.sasl.jaas.config", EH_SASL)
.option("kafka.batch.size", 250)
.option("kafka.bootstrap.servers", "server1:port1,server2:port2,server3:port3")
.option("kafka.request.timeout.ms", 120000)
.option("topic", "topic_name")
.save()
I get the error if I don't pass any auth properties-:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (executor 0): kafkashaded.org.apache.kafka.common.errors.TimeoutException: Topic topic_name not present in metadata after 60000 ms.
But unable to find the correct EH_SASL format which comprises the clickt_id and IMS token for auth.

Did you check if there is a need to add kafka.sasl.jaas.config with username and password.
"kafka.sasl.jaas.config": 'org.apache.kafka.common.security.plain.PlainLoginModule required username="{kafka_username}" password="{kafka_pwd}

Related

Databricks Streaming Jobs Mysteriously Terminating after 5 minutes

I'm trying to use Databricks on Azure with a Spark structured streaming job and an having very mysterious issue.
I boiled the job down it it's basics for testing, reading from a Kafka topic and writing to console in a forEachBatch.
On local, everything works fine indefinitely.
On Databricks, the task terminates after just over 5 minutes with a "Cancelled" status.
There are no errors in the log, just this, which appears to be a graceful shutdown request of some kind, but I don't know where it's coming from
22/11/04 18:31:30 INFO DriverCorral$: Cleaning the wrapper ReplId-1ea30-8e4c0-48422-a (currently in status Running(ReplId-1ea30-8e4c0-48422-a,ExecutionId(job-774316032912321-run-84401-action-5645198327600153),RunnableCommandId(9102993760433650959)))
22/11/04 18:31:30 INFO DAGScheduler: Asked to cancel job group 2207618020913201706_9102993760433650959_job-774316032912321-run-84401-action-5645198327600153
22/11/04 18:31:30 INFO ScalaDriverLocal: cancelled jobGroup:2207618020913201706_9102993760433650959_job-774316032912321-run-84401-action-5645198327600153
22/11/04 18:31:30 INFO ScalaDriverWrapper: Stopping streams for commandId pattern: CommandIdPattern(2207618020913201706,None,Some(job-774316032912321-run-84401-action-5645198327600153)).
22/11/04 18:31:30 INFO DatabricksStreamingQueryListener: Stopping the stream [id=d41eff2a-4de6-4f17-8d1c-659d1c1b8d98, runId=5bae9fb4-b5e1-45a0-af1e-a2f2553592c9]
22/11/04 18:31:30 INFO DAGScheduler: Asked to cancel job group 5bae9fb4-b5e1-45a0-af1e-a2f2553592c9
22/11/04 18:31:30 INFO TaskSchedulerImpl: Cancelling stage 366
22/11/04 18:31:30 INFO TaskSchedulerImpl: Killing all running tasks in stage 366: Stage cancelled
22/11/04 18:31:30 INFO MicroBatchExecution: QueryExecutionThread.interruptAndAwaitExecutionThreadTermination called with streaming query exit timeout=15000 ms
For reference, here are is the code:
val incomingStream = spark.readStream
.format("kafka")
.option("subscribe",ehName)
.option("kafka.bootstrap.servers",topicUriWithPort)
.option("kafka.sasl.mechanism","PLAIN")
.option("kafka.security.protocol","SASL_SSL")
.option("kafka.sasl.jaas.config",jaas)
.option("startingOffsets", "earliest")
.option("failOnDataLoss","false")
.option("maxOffsetsPerTrigger", 1) //todo make config
.load()
val processedWriteStream = incomingStream
.writeStream
.queryName("query2")
.foreachBatch((d: DataFrame, b: Long) => {
d.show()
})
.start()
processedWriteStream.awaitTermination()
Structured Streaming provides fault-tolerance and data consistency for streaming queries; using Databricks workflows, you can easily configure your Structured Streaming queries to restart on failure automatically.
You can restart the query after a failure by enabling checkpointing for a streaming query.
The restarted query continues where the failed one left off.

Delta lake error on DeltaTable.forName in k8s cluster mode cannot assign instance of java.lang.invoke.SerializedLambda

I am trying to merge some data to delta table in a streaming application in k8s using spark submit in cluster mode
Getting the below error, But its works fine in k8s local mode and in my laptop, none of the operations related to delta lake is working in k8s cluster mode,
Below is the library versions i am using , is it some compatibility issue,
SPARK_VERSION_DEFAULT=3.3.0
HADOOP_VERSION_DEFAULT=3
HADOOP_AWS_VERSION_DEFAULT=3.3.1
AWS_SDK_BUNDLE_VERSION_DEFAULT=1.11.974
below is the error message
py4j.protocol.Py4JJavaError: An error occurred while calling o128.saveAsTable. : java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 4) (192.168.15.250 executor 2): java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.sql.catalyst.expressions.ScalaUDF.f of type scala.Function1 in instance of org.apache.spark.sql.catalyst.expressions.ScalaUDF
Finaly able to resolve this issue , issue was due to some reason dependant jars like delta, kafka are not available in executor , as per the below SO response
cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.sql.execution.datasources.v2.DataSourceRDD
i have added the jars in spark/jars folder using docker image and issue got resolved ,

Error reading from large Cassandra table with Spark, getting "Remote RPC client disassociated"

I set stand alone spark cluster (with cassandra) and i did it but when i read data i get error.My cluster has 3 nodes and each node has 64 GB ram and 20 cores. I'm sharing some Spark-env.sh configuration like spark_executor_cores: 5, spark_executor_memory:5G, spark_worker_cores:20 and spark_worker_memory:45g.
I want to give another information, when i read small table there is no problem but when i read big table i get error. Error description at below. Also when i start pyspark i use this command:
$ ./pyspark --master spark://10.0.0.100:7077
--packages com.datastax.spark:spark-cassandra-connector_2.12:3.1.0
--conf spark.driver.extraJavaOptions=-Xss1024m
--conf spark.driver.port:36605
--conf spark.driver.blockManager.port=42365
Thanks for your interest
ERROR TaskSchedulerImpl: Lost executor 5 on 10.0.0.10: Remote RPC client disassociated. likely due to containers exceeding threshold, or network issues. Chec driver logs for WARN messages
WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (10.0.0.10 executor 5): ExecutorLostFailure (executor 5 exited caused by one of the runnning task) reason: remote RPC client disassociated.
WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) (10.0.0.11 executor 2):Java.lang.StackOverflowError
at java.base/java.nio.ByteBuffer.position(ByteBuffer.java:1094)
at java.base/java.nio.HeapByteBuffer.get(HeapByteBuffer.java:184)
at org.apache.spark.util.ByteBufferInputStream.read(ObjectInputStream.scala:49)
at java.base/java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2887)
at java.base/java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2903)
at java.base/java.io.ObjectInputStream$BlockDataInputStream.readUTFBody(ObjectInputStream.java:3678)
at java.base/java.io.ObjectInputStream$BlockDataInputStream.readUTF(ObjectInputStream.java:3678)
at java.base/java.io.ObjectInputStream.readString(ObjectInputStream.java:2058)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1663)
at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2490)
at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2384)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2222)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1681)
at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2490)
at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2384)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2222)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1681)
at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2490)
at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2384)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2222)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1681)
at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2490)
at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2384)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2222)
The problem you're running into is most likely a networking issue.
It's highly unusual that you need to pin the driver ports with:
--conf spark.driver.port:36605
--conf spark.driver.blockManager.port=42365
You'll need to provide background information on why you're doing this.
Also as I previously advised you on another question last week, you need to provide the minimal code + minimal configuration that replicates the problem. Otherwise, there isn't enough information for others to be able to help you. Cheers!

Job aborted when writing table using different cluster on Databricks

I have two clusters on databricks and i used one (cluster1) to write a table on the datalake. I need to use the other cluster (cluster2) to schedule the job in charge of writing this table. However, this error occurs:
Py4JJavaError: An error occurred while calling o344.saveAsTable.
: org.apache.spark.SparkException: Job aborted.
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 3740.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3740.0 (TID
113976, 10.246.144.215, executor 13): org.apache.hadoop.security.AccessControlException:
CREATE failed with error 0x83090aa2 (Forbidden. ACL verification failed. Either the
resource does not exist or the user is not authorized to perform the requested operation.).
[7974c88e-0300-4e1b-8f07-a635ad8637fb] failed with error 0x83090aa2 (Forbidden.
ACL verification failed. Either the resource does not exist or the user is not authorized
to perform the requested operation.).
From the "Caused by" message it seems that I do not have the authorization to write on the datalake, but if i change the table name it successfully write the df onto the datalake.
I am trying to write the table with the following command:
df.write \
.format('delta') \
.mode('overwrite')\
.option('path', path)\
.option('overwriteSchema', "true")\
.saveAsTable(table_name)
I tried to drop the table and rewriting it using the cluster2 but this doesn't work, as if the location on the datalake is already occupied: only using cluster1 I can write in that location.
In the past I simply changed the table name as a workaround, but this time I need to keep the old name.
How can I solve this? Why the datalake is related to the cluster with which i wrote the table?
The issue was cause by different Service Principals used for the two clusters.
To solve the problem I had to drop the table and remove the path in the datalake with cluster1. Then, I could write the table again using cluster2.
The command to delete the path is:
rm -r 'adl://path/to/table'

How can I connect Azure Databricks to Cosmos DB using MongoDB API?

I have created one azure CosmosDB account using MongoDB API. I need to connect CosmosDB(MongoDB API) to Azure Databricks cluster in order to read and write data from cosmos.
How to connect Azure Databricks cluster to CosmosDB account?
Here is the pyspark piece of code I use to connect to a CosmosDB database using MongoDB API from Azure Databricks (5.2 ML Beta (includes Apache Spark 2.4.0, Scala 2.11) and MongoDB connector: org.mongodb.spark:mongo-spark-connector_2.11:2.4.0
):
from pyspark.sql import SparkSession
my_spark = SparkSession \
.builder \
.appName("myApp") \
.getOrCreate()
df = my_spark.read.format("com.mongodb.spark.sql.DefaultSource") \
.option("uri", CONNECTION_STRING) \
.load()
With a CONNECTION_STRING that looks like that:
"mongodb://USERNAME:PASSWORD#testgp.documents.azure.com:10255/DATABASE_NAME.COLLECTION_NAME?ssl=true&replicaSet=globaldb"
I tried a lot of different other options (adding database and collection names as option or config of the SparkSession) without success.
Tell me if it works for you...
After adding the org.mongodb.spark:mongo-spark-connector_2.11:2.4.0 package, this worked for me:
import json
query = {
'$limit': 100,
}
query_config = {
'uri': 'myConnectionString'
'database': 'myDatabase',
'collection': 'myCollection',
'pipeline': json.dumps(query),
}
df = spark.read.format("com.mongodb.spark.sql") \
.options(**query_config) \
.load()
I do, however, get this error with some collections:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, 10.139.64.6, executor 0): com.mongodb.MongoInternalException: The reply message length 10168676 is less than the maximum message length 4194304
Answering the same way I did to my own question.
Using MAVEN as the source, I installed the right library to my cluster using the path
org.mongodb.spark:mongo-spark-connector_2.11:2.4.0
Spark 2.4
An example of code I used is as follows (for those who wanna try):
# Read Configuration
readConfig = {
"URI": "<URI>",
"Database": "<database>",
"Collection": "<collection>",
"ReadingBatchSize" : "<batchSize>"
}
pipelineAccounts = "{'$sort' : {'account_contact': 1}}"
# Connect via azure-cosmosdb-spark to create Spark DataFrame
accountsTest = (spark.read.
format("com.mongodb.spark.sql").
options(**readConfig).
option("pipeline", pipelineAccounts).
load())
accountsTest.select("account_id").show()

Resources