Consume events from EventHub In Azure Databricks using pySpark - azure

I could see spark connectors & guidelines for consuming events from Event Hub using Scala in Azure Databricks.
But, How can we consume events in event Hub from azure databricks using pySpark?
any suggestions/documentation details would help. thanks

Below is the snippet for reading events from event hub from pyspark on azure data-bricks.
// With an entity path
val with = "Endpoint=sb://SAMPLE;SharedAccessKeyName=KEY_NAME;SharedAccessKey=KEY;EntityPath=EVENTHUB_NAME"
# Source with default settings
connectionString = "Valid EventHubs connection string."
ehConf = {
'eventhubs.connectionString' : connectionString
}
df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
readInStreamBody = df.withColumn("body", df["body"].cast("string"))
display(readInStreamBody)

I think there is slight modification that is required if you are using spark version 2.4.5 or greater and version of the Azure event Hub Connector 2.3.15 or above
For 2.3.15 version and above, the configuration dictionary requires that connection string be encrypted, So you need to pass it as shown in the code snippet below.
connectionString = "Endpoint=sb://SAMPLE;SharedAccessKeyName=KEY_NAME;SharedAccessKey=KEY;EntityPath=EVENTHUB_NAME"
ehConf = {}
ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
readInStreamBody = df.withColumn("body", df["body"].cast("string"))
display(readInStreamBody)

Related

Where should I put my credential data streaming with Kafka in databricks?

I have some values in Azure Key Vault (AKV)
A simple initial googling was giving me
username = dbutils.secrets.get(scope = "DATAAI-CEC", key = "dai-kafka-cec-api-key")
pwd = dbutils.secrets.get(scope = "DATAAI-CEC", key = "dai-kafka-cec-secret")
from kafka import KafkaConsumer
consumer = KafkaConsumer('TOPIC',
bootstrap_servers = 'SERVER:PORT',
enable_auto_commit = False,
auto_offset_reset = 'earliest',
consumer_timeout_ms = 2000,
security_protocol = 'SASL_SSL',
sasl_mechanism = 'PLAIN',
sasl_plain_username = username,
sasl_plain_password = pwd)
This one works one time when the cell in databricks runs, however, after a single run it is finished, and it is not listening to Kafka messages anymore, and the cluster goes to the off state after the configured time (in my case 30 minutes)
So it doesn't solve my problem
My next google search was this blog on databricks (Processing Data in Apache Kafka with Structured Streaming in Apache Spark 2.2)
from pyspark.sql.types import *
from pyspark.sql.functions import from_json
from pyspark.sql.functions import *
schema = StructType() \
.add("EventHeader", StructType() \
.add("UUID", StringType()) \
.add("APPLICATION_ID", StringType())
.add("FORMAT", StringType())) \
.add("EmissionReportMessage", StructType() \
.add("reportId", StringType()) \
.add("startDate", StringType()) \
.add("endDate", StringType()) \
.add("unitOfMeasure", StringType()) \
.add("reportLanguage", StringType()) \
.add("companies", ArrayType(StructType([StructField("ccid", StringType(), True)]))))
parsed_kafka = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "SERVER:PORT") \
.option("subscribe", "TOPIC") \
.option("startingOffsets", "earliest") \
.load()\
.select(from_json(col("value").cast("string"), schema).alias("kafka_parsed_value"))
There are some issues
Where should I put my GenID or user/pass info?
When I run the display command, it runs, but it will never stop, and it will never show the result
however, after a single run it is finished, and it is not listening to Kafka messages anymore
Given that you have enable_auto_commit = False, it should continue to work on following runs. But this isn't using Spark...
Where should I put my GenID or user/pass info
You would add SASL/SSL properties into option() parameters.
Ex. For SASL_PLAIN
option("kafka.sasl.jaas.config",
'org.apache.kafka.common.security.plain.PlainLoginModule required username="{}" password="{}";'.format(username, password))
See related question
it will never stop
Because you run a streaming query starting with readStream rather than a batched read.
it will never show the result
You'll need to use parsed_kafka.writeStream.format("console"), for example somewhere (assuming you want to start with readStream, rather than display() and read

In Azure databricks writing pyspark dataframe to eventhub is taking too long as there3 Million records in dataframe

Oracle database table has 3 million records. I need to read it into dataframe and then convert it to json format and send it to eventhub for downstream systems.
Below is my pyspark code to connect and read oracle db table as dataframe
df = spark.read \
.format("jdbc") \
.option("url", databaseurl) \
.option("query","select * from tablename") \
.option("user", loginusername) \
.option("password", password) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("oracle.jdbc.timezoneAsRegion", "false") \
.load()
then I am converting the column names and values of each row into json (placing under a new column named body) and then sending it to Eventhub.
I have defined ehconf and eventhub connection string. Below is my write to eventhub code
df.select("body") \
.write\
.format("eventhubs") \
.options(**ehconf) \
.save()
my pyspark code is taking 8 hours to send 3 million records to eventhub.
Could you please suggest how to write pyspark dataframe to eventhub faster ?
My Eventhub is created under eventhub cluster which has 1 CU in capacity
Databricks cluster config :
mode: Standard
runtime: 10.3
worker type: Standard_D16as_v4 64GB Memory,16 cores (min workers :1, max workers:5)
driver type: Standard_D16as_v4 64GB Memory,16 cores
The problem is that the jdbc connector just uses one connection to the database by default so most of your workers are probably idle. That is something you can confirm in Cluster Settings > Metrics > Ganglia UI.
To actually make use of all the workers the jdbc connector needs to know how to parallelize retrieving your data. For this you need a field that has evenly distributed data over its values. For example if you have a date field in your data and every date has a similar amount of records, you can use it to split up the data:
df = spark.read \
.format("jdbc") \
.option("url", jdbcUrl) \
.option("dbtable", tableName) \
.option("user", jdbcUsername) \
.option("password", jdbcPassword) \
.option("numPartitions", 64) \
.option("partitionColumn", "<dateField>") \
.option("lowerBound", "2019-01-01") \
.option("upperBound", "2022-04-07") \
.load()
You have to define the field name and the min and max value of that field so that the jdbc connector can try to split the work evenly between the workers. The numPartitions is the amount of individual connections opened and the best value depends on the count of workers in your cluster and how many connections your datasource can handle.

Databricks on Apache Spark AttributeError: 'str' object has no attribute '_jvm'

When attempting to readStream data fron Azure Event Hub with Databricks on Apache Spark I get the error
AttributeError: 'str' object has no attribute '_jvm'
The details of the error is as follows:
----> 8 ehConf['eventhubs.connectionString'] = sparkContext._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
The code is as follows:
sparkContext = ""
connectionString = 'Endpoint=sb://namespace.servicebus.windows.net/;SharedAccessKeyName=both4;SharedAccessKey=adfdMyKeyIGBKYBs=;EntityPath=hubv5'
# Source with default settings
connectionString = connectionString
ehConf = {}
ehConf['eventhubs.connectionString'] = sparkContext._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
streaming_df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
Has anyone come across this error and found a solution?
It shouldn't be the sparkContext, but just sc:
ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
P.S. But it's just easier to use built-in Kafka connector with EventHubs - you don't need to install anything, and it's more performant...

PySpark Kafka - NoClassDefFound: org/apache/commons/pool2

I am encountering problem with printing the data to console from kafka topic.
The error message I get is shown in below image.
As you can see in the above image that after batch 0 , it doesn't process further.
All this are snapshots of the error messages. I don't understand the root cause of the errors occurring. Please help me.
Following are kafka and spark version:
spark version: spark-3.1.1-bin-hadoop2.7
kafka version: kafka_2.13-2.7.0
I am using the following jars:
kafka-clients-2.7.0.jar
spark-sql-kafka-0-10_2.12-3.1.1.jar
spark-token-provider-kafka-0-10_2.12-3.1.1.jar
Here is my code:
spark = SparkSession \
.builder \
.appName("Pyspark structured streaming with kafka and cassandra") \
.master("local[*]") \
.config("spark.jars","file:///C://Users//shivani//Desktop//Spark//kafka-clients-2.7.0.jar,file:///C://Users//shivani//Desktop//Spark//spark-sql-kafka-0-10_2.12-3.1.1.jar,file:///C://Users//shivani//Desktop//Spark//spark-cassandra-connector-2.4.0-s_2.11.jar,file:///D://mysql-connector-java-5.1.46//mysql-connector-java-5.1.46.jar,file:///C://Users//shivani//Desktop//Spark//spark-token-provider-kafka-0-10_2.12-3.1.1.jar")\
.config("spark.executor.extraClassPath","file:///C://Users//shivani//Desktop//Spark//kafka-clients-2.7.0.jar,file:///C://Users//shivani//Desktop//Spark//spark-sql-kafka-0-10_2.12-3.1.1.jar,file:///C://Users//shivani//Desktop//Spark//spark-cassandra-connector-2.4.0-s_2.11.jar,file:///D://mysql-connector-java-5.1.46//mysql-connector-java-5.1.46.jar,file:///C://Users//shivani//Desktop//Spark//spark-token-provider-kafka-0-10_2.12-3.1.1.jar")\
.config("spark.executor.extraLibrary","file:///C://Users//shivani//Desktop//Spark//kafka-clients-2.7.0.jar,file:///C://Users//shivani//Desktop//Spark//spark-sql-kafka-0-10_2.12-3.1.1.jar,file:///C://Users//shivani//Desktop//Spark//spark-cassandra-connector-2.4.0-s_2.11.jar,file:///D://mysql-connector-java-5.1.46//mysql-connector-java-5.1.46.jar,file:///C://Users//shivani//Desktop//Spark//spark-token-provider-kafka-0-10_2.12-3.1.1.jar")\
.config("spark.driver.extraClassPath","file:///C://Users//shivani//Desktop//Spark//kafka-clients-2.7.0.jar,file:///C://Users//shivani//Desktop//Spark//spark-sql-kafka-0-10_2.12-3.1.1.jar,file:///C://Users//shivani//Desktop//Spark//spark-cassandra-connector-2.4.0-s_2.11.jar,file:///D://mysql-connector-java-5.1.46//mysql-connector-java-5.1.46.jar,file:///C://Users//shivani//Desktop//Spark//spark-token-provider-kafka-0-10_2.12-3.1.1.jar")\
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
#streaming dataframe that reads from kafka topic
df_kafka=spark.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers",kafka_bootstrap_servers)\
.option("subscribe",kafka_topic_name)\
.option("startingOffsets", "latest") \
.load()
print("Printing schema of df_kafka:")
df_kafka.printSchema()
#converting data from kafka broker to string type
df_kafka_string=df_kafka.selectExpr("CAST(value AS STRING) as value")
# schema to read json format data
ts_schema = StructType() \
.add("id_str", StringType()) \
.add("created_at", StringType()) \
.add("text", StringType())
#parse json data
df_kafka_string_parsed=df_kafka_string.select(from_json(col("value"),ts_schema).alias("twts"))
df_kafka_string_parsed_format=df_kafka_string_parsed.select("twts.*")
df_kafka_string_parsed_format.printSchema()
df=df_kafka_string_parsed_format.writeStream \
.trigger(processingTime="1 seconds") \
.outputMode("update")\
.option("truncate","false")\
.format("console")\
.start()
df.awaitTermination()
The error (NoClassDefFound, followed by the kafka010 package) is saying that spark-sql-kafka-0-10 is missing its transitive dependency on org.apache.commons:commons-pool2:2.6.2, as you can see here
You can either download that JAR as well, or you can change your code to use --packages instead of spark.jars option, and let Ivy handle downloading transitive dependencies
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache...'
spark = SparkSession.bulider...

Trying to consuming the kafka streams using spark structured streaming

I'm new to Kafka streaming. I setup a twitter listener using python and it is running in the localhost:9092 kafka server. I could consume the stream produced by the listener using a kafka client tool (conduktor) and also using the command "bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic twitter --from-beginning"
BUt when i try to consume the same stream using Spark Structured streaming, it is not capturing and throws the error - Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;
Find the screenshot below
Command output - Consumes Data
Jupyter output for spark consumer - Doesn't consume data
My Producer or listener code:
auth = tweepy.OAuthHandler("**********", "*************")
auth.set_access_token("*************", "***********************")
# session.set('request_token', auth.request_token)
api = tweepy.API(auth)
class KafkaPushListener(StreamListener):
def __init__(self):
#localhost:9092 = Default Zookeeper Producer Host and Port Adresses
self.client = pykafka.KafkaClient("0.0.0.0:9092")
#Get Producer that has topic name is Twitter
self.producer = self.client.topics[bytes("twitter", "ascii")].get_producer()
def on_data(self, data):
#Producer produces data for consumer
#Data comes from Twitter
self.producer.produce(bytes(data, "ascii"))
return True
def on_error(self, status):
print(status)
return True
twitter_stream = Stream(auth, KafkaPushListener())
twitter_stream.filter(track=['#fashion'])
Consumer access from Spark Structured streaming
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "twitter") \
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
Found what was missing, when I submitted the spark-job, I had to include the right dependency package version.
I have spark 3.0.0
Therefore, I included - org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 package
Add sink It will start consum data from kafka.
Check below code.
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "twitter") \
.load()
query = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
.writeStream \
.outputMode("append") \
.format("console") \ # here I am using console format .. you may change as per your requirement.
.start()
query.awaitTermination()

Resources