Azure databricks avro integration on Azure Schema registry - azure

I would like to be able to write a pyspark dataframe to kafka as avro, with automatically registering schema with the Azure schema registry in EventHub.
This should be possible according to docs e.g. https://learn.microsoft.com/en-us/azure/databricks/structured-streaming/avro-dataframe
However the code below fails with
TypeError: to_avro() takes from 1 to 2 positional arguments but 3 were given
So I am probably not importing the right to_avro method? My environment is Azure Databricks - DBR 11.3 LTS
from pyspark.sql.functions import col, lit
from pyspark.sql.avro.functions import from_avro, to_avro
schema_registry_addr = "https://someeventhub.servicebus.windows.net:8081"
df \
.select(
to_avro(col("key"), lit("t-key"),
schema_registry_addr).alias("key"),
to_avro(col("value"), lit("t-value"), schema_registry_addr).alias("value")) \
.writeStream \
.format("kafka") \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.jaas.config", EH_SASL) \
.option("kafka.bootstrap.servers", BOOTSTRAP_SERVERS) \
.option("topic", TOPIC) \
.option("checkpointLocation", "/FileStore/checkpoint") \
.start()```

Related

Reading from Azure Event hub with Kafka driver doesn't seem to get any data

I'm running the following code in an Azure Databricks python notebook:
TOPIC = "myeventhub"
BOOTSTRAP_SERVERS = "myeventhubns.servicebus.windows.net:9093"
EH_SASL = "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username=\"$ConnectionString\" password=\"Endpoint=sb://myeventhubns.servicebus.windows.net/;SharedAccessKeyName=MyKeyName;SharedAccessKey=myaccesskey;\";"
df = spark.readStream \
.format("kafka") \
.option("subscribe", TOPIC) \
.option("kafka.bootstrap.servers", BOOTSTRAP_SERVERS) \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.jaas.config", EH_SASL) \
.option("kafka.request.timeout.ms", "60000") \
.option("kafka.session.timeout.ms", "60000") \
.option("failOnDataLoss", "false") \
.option("startingOffsets", "earliest") \
.load()
df_write = df.writeStream \
.outputMode("append") \
.format("console") \
.start() \
.awaitTermination()
This shows no output in the notebook. How could I debug what the problem is?
If you use .format("console") then output won't be in the notebook, it will be in the driver & executor logs - it's a difference between Spark and Databricks.
If you want to see the data, just use the display function:
display(df)
This code is now writing data with quite low latency. Newest datapoint is around 10 seconds old when I do a select in a sql warehouse. The problem still is that foreachBatch is not run, but otherwise it's working.
TOPIC = "myeventhub"
BOOTSTRAP_SERVERS = "myeventhub.servicebus.windows.net:9093"
EH_SASL = "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username=\"$ConnectionString\" password=\"Endpoint=sb://myeventhub.servicebus.windows.net/;SharedAccessKeyName=mykeyname;SharedAccessKey=mykey;EntityPath=myentitypath;\";"
df = spark.readStream \
.format("kafka") \
.option("subscribe", TOPIC) \
.option("kafka.bootstrap.servers", BOOTSTRAP_SERVERS) \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.jaas.config", EH_SASL) \
.option("kafka.request.timeout.ms", "60000") \
.option("kafka.session.timeout.ms", "60000") \
.option("failOnDataLoss", "false") \
.option("startingOffsets", "earliest") \
.load()
n = 100
count = 0
def run_command(batchDF, epoch_id):
global count
count += 1
if count % n == 0:
spark.sql("OPTIMIZE firstcatalog.bronze.factorydatas3 ZORDER BY (readtimestamp)")
...Omitted code where I transform the data in the value column to strongly typed data...
myTypedDF.writeStream \
.foreachBatch(run_command) \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/tmp/delta/events/_checkpoints/") \
.partitionBy("somecolumn") \
.toTable("myunitycatalog.bronze.mytable")

Unable to Connect to Apache Spark Kafka Streaming Using SSL on EMR Notebooks

I'm trying to connect to a Kafka consumer secured by SSL using spark structured streaming but I am having issues. I have the Kafka consumer working and reading events using confluent_kafka with the following code:
conf = {'bootstrap.servers': 'url:port',
'group.id': 'group1',
'enable.auto.commit': False,
'security.protocol': 'SSL',
'ssl.key.location': 'file1.key',
'ssl.ca.location': 'file2.pem',
'ssl.certificate.location': 'file3.cert',
'auto.offset.reset': 'earliest'
}
consumer = Consumer(conf)
consumer.subscribe(['my_topic'])
# Reads events without issues
msg = consumer.poll(timeout=0)
I'm having issues replicating this code with Spark Structured Streaming on EMR Notebooks.
This is the current setup I have on EMR Notebooks:
%%configure -f
{
"conf": {
"spark.jars.packages": "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0",
"livy.rsc.server.connect.timeout":"600s",
"spark.pyspark.python": "python3",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"
}
}
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark import SparkFiles
cert_file = 'file3.cert'
pem_file = 'file2.pem'
key_file = 'file3.key'
sc.addFile(f's3://.../{cert_file}')
sc.addFile(f's3://.../{pem_file}')
sc.addFile(f's3://.../{key_file}')
spark = SparkSession\
.builder \
.getOrCreate()
kafka_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "url:port") \
.option("kafka.group.id", "group1") \
.option("enable.auto.commit", "False") \
.option("kafka.security.protocol", "SSL") \
.option("kafka.ssl.key.location", SparkFiles.get(key_file)) \ # SparkFiles.get() works
.option("kafka.ssl.ca.location", SparkFiles.get(pem_file)) \
.option("kafka.ssl.certificate.location", SparkFiles.get(cert_file)) \
.option("startingOffsets", "earliest") \
.option("subscribe", "my_topic") \
.load()
query = kafka_df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
.writeStream \
.outputMode("append") \
.format("console") \
.start()
And no rows appear in the Structured Streaming tab in the spark UI even though I expect the rows to show up instantly since I am using the earliest startingOffsets.
My hypothesis is that readStream doesn't work because the SSL information is not set up correctly. I've looked and haven't found .option() parameters that directly correspond to the confluent_kafka API.
Any help would be appreciated.

spark streaming for python not working in databricks

I am trying to read from a confluent topic via spark streaming with python in databricks.
So i have 2 questions
I tried to read from a topic but it keeps giving me a "failed to construct kafka consumer"
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "pkc-xxxxxxxxx.confluent.cloud:9092") \
.option("subscribe", "topic1") \
.option("kafka.sasl.mechanisms", "PLAIN")\
.option("kafka.security.protocol", "SASL_SSL")\
.option("kafka.sasl.username","xxxx")\
.option("kafka.sasl.password", "xxxx")\
.option("startingOffsets", "earliest")\
.option("failOnDataLoss", "false")\
.load()\
.select('topic', 'partition', 'offset', 'timestamp', 'timestampType', 'key')
I then tried to do a
display(df);
and i keep getting a
kafkashaded.org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
is there something im missing? Im trying to see my dataframe which im trying to fetch from my confluent topic
how do i enable spark stream to be listening to my topics continuously in databricks? On my laptop I can do a spark submit to a cluster but im not very sure in databricks.
Any help is appreciated!
Thanks.
to those who are wondering why
df = (spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "pkc-xxxx:9092")
.option("subscribe", "GOOG_GLOBAL_MOBILITY_REPORT_RAW")
.option("kafka.security.protocol","SASL_SSL")
.option("kafka.sasl.mechanism", "PLAIN")
.option("kafka.sasl.jaas.config", """kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="xxxx" password="xxxxx";""")
.load()
.withColumn('key', fn.col("key").cast(StringType()))
.withColumn('value', fn.col("value").cast(StringType()))
)

Spark Structural Streaming with Confluent Cloud Kafka connectivity issue

I am writing a Spark structured streaming application in PySpark to read data from Kafka in Confluent Cloud. The documentation for the spark readstream() function is too shallow and didn't specify much on the optional parameter part especially on the auth mechanism part. I am not sure what parameter goes wrong and crash the connectivity. Can anyone have experience in Spark help me to start this connection?
Required Parameter
> Consumer({'bootstrap.servers':
> 'cluster.gcp.confluent.cloud:9092',
> 'sasl.username':'xxx',
> 'sasl.password': 'xxx',
> 'sasl.mechanisms': 'PLAIN',
> 'security.protocol': 'SASL_SSL',
> 'group.id': 'python_example_group_1',
> 'auto.offset.reset': 'earliest' })
Here is my pyspark code:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "cluster.gcp.confluent.cloud:9092") \
.option("subscribe", "test-topic") \
.option("kafka.sasl.mechanisms", "PLAIN")\
.option("kafka.security.protocol", "SASL_SSL")\
.option("kafka.sasl.username","xxx")\
.option("kafka.sasl.password", "xxx")\
.option("startingOffsets", "latest")\
.option("kafka.group.id", "python_example_group_1")\
.load()
display(df)
However, I keep getting an error:
kafkashaded.org.apache.kafka.common.KafkaException: Failed to
construct kafka consumer
DataBrick Notebook- for testing
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4673082066872014/3543014086288496/1802788104169533/latest.html
Documentation
https://home.apache.org/~pwendell/spark-nightly/spark-branch-2.0-docs/latest/structured-streaming-kafka-integration.html
This error indicates that JAAS configuration is not visible to your Kafka consumer. To solve this issue include JASS based on the follow steps:
Step01: Create a file for below JAAS file : /home/jass/path
KafkaClient {
com.sun.security.auth.module.Krb5LoginModule required
useTicketCache=true
renewTicket=true
serviceName="kafka";
};
Step02: Call that JASS file path in spark-submit based on the below conf parameter .
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/home/jass/path"
Full spark-submit command :
/usr/hdp/2.6.1.0-129/spark2/bin/spark-submit --packages com.databricks:spark-avro_2.11:3.2.0,org.apache.spark:spark-avro_2.11:2.4.0,org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 --conf spark.ui.port=4055 --files /home/jass/path,/home/bdpda/bdpda.headless.keytab --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/home/jass/path" --conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=/home/jass/path" pysparkstructurestreaming.py
Pyspark Structured streaming sample code :
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.streaming import StreamingContext
import time
# Spark Streaming context :
spark = SparkSession.builder.appName('PythonStreamingDirectKafkaWordCount').getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 20)
# Kafka Topic Details :
KAFKA_TOPIC_NAME_CONS = "topic_name"
KAFKA_OUTPUT_TOPIC_NAME_CONS = "topic_to_hdfs"
KAFKA_BOOTSTRAP_SERVERS_CONS = 'kafka_server:9093'
# Creating readstream DataFrame :
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS_CONS) \
.option("subscribe", KAFKA_TOPIC_NAME_CONS) \
.option("startingOffsets", "earliest") \
.option("kafka.security.protocol","SASL_SSL")\
.option("kafka.client.id" ,"Clinet_id")\
.option("kafka.sasl.kerberos.service.name","kafka")\
.option("kafka.ssl.truststore.location", "/home/path/kafka_trust.jks") \
.option("kafka.ssl.truststore.password", "password_rd") \
.option("kafka.sasl.kerberos.keytab","/home/path.keytab") \
.option("kafka.sasl.kerberos.principal","path") \
.load()
df1 = df.selectExpr( "CAST(value AS STRING)")
# Creating Writestream DataFrame :
df1.writeStream \
.option("path","target_directory") \
.format("csv") \
.option("checkpointLocation","chkpint_directory") \
.outputMode("append") \
.start()
ssc.awaitTermination()
We need to specified kafka.sasl.jaas.config to add the username and password for the Confluent Kafka SASL-SSL auth method. Its parameter looks a bit odd, but it's working.
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "pkc-43n10.us-central1.gcp.confluent.cloud:9092") \
.option("subscribe", "wallet_txn_log") \
.option("startingOffsets", "earliest") \
.option("kafka.security.protocol","SASL_SSL") \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.sasl.jaas.config", """kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="xxx" password="xxx";""").load()
display(df)

How to format a pyspark connection string for Azure Eventhub with Kafka

I am trying to parse JSON messages with Pyspark from an Azure Eventhub with enabled Kafka compatibility. I can't find any documentation on how to establish the connection.
import os
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
sc.stop() # Jupyter somehow created a context already..
sc = SparkContext(appName="PythonTest")
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, 60)
# my connection string:
#Endpoint=sb://example.servicebus.windows.net/;SharedAccessKeyName=examplekeyname;SharedAccessKey=HERETHEJEY=;EntityPath=examplepathname - has a total of 5 partitions
kafkaStream = KafkaUtils.createStream(HOW DO I STRUCTURE THIS??)
parsed = kafkaStream.map(lambda v: json.loads(v[1]))
parsed.count().map(lambda x:'Messages in this batch: %s' % x).pprint()
ssc.start()
ssc.awaitTermination()
See my answer (and question) here. That was for how to write to an Kafka-enabled Event Hub in pyspark but I assume reading config should be pretty similar. The tricky part was to get the security configuration right.
EH_SASL = 'org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="Endpoint=sb://myeventhub.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=****";'
// Source: https://github.com/Azure/azure-event-hubs-for-kafka/tree/master/tutorials/spark#running-spark
dfKafka \
.write \
.format("kafka") \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.jaas.config", EH_SASL) \
.option("kafka.batch.size", 5000) \
.option("kafka.bootstrap.servers", "myeventhub.servicebus.windows.net:9093") \
.option("kafka.request.timeout.ms", 120000) \
.option("topic", "raw") \
.option("checkpointLocation", "/mnt/telemetry/cp.txt") \
.save()
You can find any official tutorial on how to set up a consumer here. It's for Scala instead of PySpark but it's fairly easy to transform the code if you compare it with my example.

Resources