Reading from Azure Event hub with Kafka driver doesn't seem to get any data - apache-spark

I'm running the following code in an Azure Databricks python notebook:
TOPIC = "myeventhub"
BOOTSTRAP_SERVERS = "myeventhubns.servicebus.windows.net:9093"
EH_SASL = "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username=\"$ConnectionString\" password=\"Endpoint=sb://myeventhubns.servicebus.windows.net/;SharedAccessKeyName=MyKeyName;SharedAccessKey=myaccesskey;\";"
df = spark.readStream \
.format("kafka") \
.option("subscribe", TOPIC) \
.option("kafka.bootstrap.servers", BOOTSTRAP_SERVERS) \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.jaas.config", EH_SASL) \
.option("kafka.request.timeout.ms", "60000") \
.option("kafka.session.timeout.ms", "60000") \
.option("failOnDataLoss", "false") \
.option("startingOffsets", "earliest") \
.load()
df_write = df.writeStream \
.outputMode("append") \
.format("console") \
.start() \
.awaitTermination()
This shows no output in the notebook. How could I debug what the problem is?

If you use .format("console") then output won't be in the notebook, it will be in the driver & executor logs - it's a difference between Spark and Databricks.
If you want to see the data, just use the display function:
display(df)

This code is now writing data with quite low latency. Newest datapoint is around 10 seconds old when I do a select in a sql warehouse. The problem still is that foreachBatch is not run, but otherwise it's working.
TOPIC = "myeventhub"
BOOTSTRAP_SERVERS = "myeventhub.servicebus.windows.net:9093"
EH_SASL = "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username=\"$ConnectionString\" password=\"Endpoint=sb://myeventhub.servicebus.windows.net/;SharedAccessKeyName=mykeyname;SharedAccessKey=mykey;EntityPath=myentitypath;\";"
df = spark.readStream \
.format("kafka") \
.option("subscribe", TOPIC) \
.option("kafka.bootstrap.servers", BOOTSTRAP_SERVERS) \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.jaas.config", EH_SASL) \
.option("kafka.request.timeout.ms", "60000") \
.option("kafka.session.timeout.ms", "60000") \
.option("failOnDataLoss", "false") \
.option("startingOffsets", "earliest") \
.load()
n = 100
count = 0
def run_command(batchDF, epoch_id):
global count
count += 1
if count % n == 0:
spark.sql("OPTIMIZE firstcatalog.bronze.factorydatas3 ZORDER BY (readtimestamp)")
...Omitted code where I transform the data in the value column to strongly typed data...
myTypedDF.writeStream \
.foreachBatch(run_command) \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/tmp/delta/events/_checkpoints/") \
.partitionBy("somecolumn") \
.toTable("myunitycatalog.bronze.mytable")

Related

PySpark: are different transformation on a kafka read dataframe working on the same data?

I am new to spark and I have a question about the load behavior when dealing with kafka batch API: suppose I am reading the dataframe in this way:
df = spark.read \
.format("kafka") \
.option("subscribe", topic_name) \ # many other options follows
.load()
I want to perform different transformations on it: for instance extracting the last offset read from each partition, and manipulate the data to create some parquet files:
aggregates = df.groupBy("partition").max("offset").collect()
# some filtering on df and finally
df.write.parquet(target_path)
My understanding is that the data will be actually read just once when load is called, and then the aggregate and write will work on the same data. If during the processing new events are pushed to kafka, they will be ignored.
Is this correct, or actually the collect and write.parquet will re-trigger the loading, and they can see different sets of kafka events?
I spent some time to try it and see what is happening. Here the databricks notebook I used to test
# Databricks notebook source
# MAGIC %md
# MAGIC Let's read some data from the real topic and put 1000 lines in the topic_test, to play with it
# COMMAND ----------
truststore = '****'
trustStorePasword = '****'
source_data = spark.read \
.format("kafka") \
.option("kafka.ssl.truststore.location", truststore) \
.option("kafka.ssl.truststore.password", trustStorePasword) \
.option("kafka.bootstrap.servers", "****") \
.option("kafka.security.protocol", "ssl") \
.option("kafka.ssl.truststore.type", "PKCS12") \
.option("subscribe", "sometopic") \
.load()
# COMMAND ----------
source_data \
.limit(1000) \
.select("key","value") \
.write \
.format("kafka") \
.option("kafka.ssl.truststore.location", truststore) \
.option("kafka.ssl.truststore.password", trustStorePasword) \
.option("kafka.bootstrap.servers", "****") \
.option("kafka.security.protocol", "ssl") \
.option("kafka.ssl.truststore.type", "PKCS12") \
.option("topic", "topic_test") \
.save()
# COMMAND ----------
# MAGIC %md
# MAGIC Now let's read from topic_test and compute the aggregate
# COMMAND ----------
df = spark.read \
.format("kafka") \
.option("kafka.ssl.truststore.location", truststore) \
.option("kafka.ssl.truststore.password", trustStorePasword) \
.option("kafka.bootstrap.servers", "****") \
.option("kafka.security.protocol", "ssl") \
.option("kafka.ssl.truststore.type", "PKCS12") \
.option("subscribe", "topic_test") \
.load()
# COMMAND ----------
display(df.groupBy("partition").max("offset").collect())
# COMMAND ----------
df.count() # shows 1000
# COMMAND ----------
# MAGIC %md
# MAGIC Let's push again some data to topic_test - we will se if there will be a change in the count
# COMMAND ----------
source_data \
.limit(100) \
.select("key","value") \
.write \
.format("kafka") \
.option("kafka.ssl.truststore.location", truststore) \
.option("kafka.ssl.truststore.password", trustStorePasword) \
.option("kafka.bootstrap.servers", "****") \
.option("kafka.security.protocol", "ssl") \
.option("kafka.ssl.truststore.type", "PKCS12") \
.option("topic", "topic_test") \
.save()
# COMMAND ----------
# MAGIC %md
# MAGIC count should be 1000
# COMMAND ----------
df.count() # it's 1100
So it reads again the data and the two operations can see different things. I tried again adding a .cache after the first load and with that the count did not change.

Spark streaming: get the max values

Hi I am triying to get the most repeated values from a stream data.
In order to do this I have the following code:
import os
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.functions import regexp_extract, col
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-10_2.12:3.2.0,org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 pyspark-shell'
spark = SparkSession \
.builder \
.appName("SSKafka") \
.getOrCreate()
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", 'localhost:9092') \
.option("subscribe", 'twitter') \
.option("startingTimestamp", 1000) \
.option("startingOffsets", "earliest") \
.load()
ds = df \
.selectExpr("CAST(value AS STRING)", "timestamp") \
.select(regexp_extract(col('value'), '#(\w+)', 1).alias('hashtags'), 'timestamp')
df_group = ds.withWatermark("timestamp", "5 seconds") \
.groupBy(
'timestamp',
'hashtags'
).agg(
F.count(col('hashtags')).alias('total')
)
query = df_group \
.writeStream \
.outputMode("append") \
.format("console") \
.option("truncate", "False") \
.start()
query.awaitTermination()
The idea is to process a batch of 5 seconds, and show when each batch is processed the current top hashtags most used.
The main idea was using this code without group by timestamp, but I got an error, about that if ds doesn't use timestamp then df_group doesn't use outputMode("append"), and I want to show the update.
Is this possible, how can I do it?
Thanks.

Why pyspark cannot show any data?

when I use Windows local spark like below, it work and Can see "df.count()"
spark = SparkSession \
.builder \
.appName("Structured Streaming ") \
.master("local[*]") \
.getOrCreate()
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
.option("subscribe", kafka_topic_name) \
.option("startingOffsets", "latest") \
.load()
flower_df1 = df.selectExpr("CAST(value AS STRING)", "timestamp")
flower_schema_string = "sepal_length DOUBLE,sepal_length DOUBLE,sepal_length DOUBLE,sepal_length DOUBLE,species STRING"
flower_df2 = flower_df1.select(from_csv(col("value"), flower_schema_string).alias("flower"), "timestamp").select("flower.*", "timestamp")
flower_df2.createOrReplaceTempView("flower_find")
song_find_text = spark.sql("SELECT * FROM flower_find")
flower_agg_write_stream = song_find_text \
.writeStream \
.option("truncate", "false") \
.format("memory") \
.outputMode("update") \
.queryName("testedTable") \
.start()
while True:
df = spark.sql("SELECT * FROM testedTable")
print(df.count())
time.sleep(1)
But when I use my Virtual Box's Ubuntu's Spark, NEVER SEE any data.
below is the modification I made when I using Ubuntu's Spark.
SparkSession's master URL: "spark://192.168.15.2:7077"
Insert code flower_agg_write_stream.awaitTermination() above "while True:"
Did I do something wrong?
ADD.
when run modification code, log appears as below:
...
org.apache.spark.sql.AnalysisException: Table or view not found: testedTable;
...
unfortunately, I already try createOrReplaceGlobalTempView(). but it doesn't work too.

Spark Structred Streaming Pyspark Sink Csv Does'nt Append

Write json to Kafka Topic and read json from kafka topic. Actually I subscribe topic and write console line by line. But I have to sink/write file csv. But I can't. I write csv one time but doesn't append.
You can see my code bellow.
Thank you!
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as func
spark = SparkSession.builder\
.config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0') \
.appName('kafka_stream_test')\
.getOrCreate()
ordersSchema = StructType() \
.add("a", StringType()) \
.add("b", StringType()) \
.add("c", StringType()) \
.add("d", StringType())\
.add("e", StringType())\
.add("f", StringType())
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "product-views") \
.load()\
df_query = df \
.selectExpr("cast(value as string)") \
.select(func.from_json(func.col("value").cast("string"),ordersSchema).alias("parsed"))\
.select("parsed.a","parsed.b","parsed.c","parsed.d","parsed.e","parsed.f")\
df = df_query \
.writeStream \
.format("csv")\
.trigger(processingTime = "5 seconds")\
.option("path", "/var/kafka_stream_test_out/")\
.option("checkpointLocation", "/user/kafka_stream_test_out/chk") \
.start()
df.awaitTermination()
Yes, because you need this extra option .option("format", "append") :
aa = df_query \
.writeStream \
.format("csv")\
.option("format", "append")\
.trigger(processingTime = "5 seconds")\
.option("path", "/var/kafka_stream_test_out/")\
.option("checkpointLocation", "/user/kafka_stream_test_out/chk") \
.outputMode("append") \
.start()

Spark structed streaming window issue

I have a problem regrading the window in Spark Structed Streaming. I want to group the data i'm receiving continuously from kafka source in sliding window and count the number of data. The issue is that writestream streams the window dataframe each time there is data coming and update the count of the current window.
I'm using the following code to create the window:
#Define schema of the topic to be consumed
jsonSchema = StructType([ StructField("State", StringType(), True) \
, StructField("Value", StringType(), True) \
, StructField("SourceTimestamp", StringType(), True) \
, StructField("Tag", StringType(), True)
])
spark = SparkSession \
.builder \
.appName("StructuredStreaming") \
.config("spark.default.parallelism", "100") \
.getOrCreate()
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "10.129.140.23:9092") \
.option("subscribe", "SIMULATOR.SUPERMAN.TOTO") \
.load() \
.select(from_json(col("value").cast("string"), jsonSchema).alias("data")) \
.select("data.*")
df = df.withColumn("time", current_timestamp())
Window = df \
.withColumn("window",window("time","4 seconds","1 seconds")).groupBy("window").count() \
.withColumn("time", current_timestamp())
#Write back to kafka
query = Window.select(to_json(struct("count","window","time")).alias("value")) \
.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "10.129.140.23:9092") \
.outputMode("update") \
.option("topic", "structed") \
.option("checkpointLocation", "/home/superman/notebook/checkpoint") \
.start()
The windows are not sorted and are updated each time there is a change in count. How can we wait for the end of the window and stream the final count one time. Instead of this output:
{"count":21,"window":{"start":"2019-05-13T09:39:14.000Z","end":"2019-05-13T09:39:18.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":47,"window":{"start":"2019-05-13T09:39:12.000Z","end":"2019-05-13T09:39:16.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:13.000Z","end":"2019-05-13T09:39:17.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:15.000Z","end":"2019-05-13T09:39:19.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:16.000Z","end":"2019-05-13T09:39:20.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":42,"window":{"start":"2019-05-13T09:39:14.000Z","end":"2019-05-13T09:39:18.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":42,"window":{"start":"2019-05-13T09:39:15.000Z","end":"2019-05-13T09:39:19.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:17.000Z","end":"2019-05-13T09:39:21.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":40,"window":{"start":"2019-05-13T09:39:16.000Z","end":"2019-05-13T09:39:20.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":19,"window":{"start":"2019-05-13T09:39:19.000Z","end":"2019-05-13T09:39:23.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":19,"window":{"start":"2019-05-13T09:39:18.000Z","end":"2019-05-13T09:39:22.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":40,"window":{"start":"2019-05-13T09:39:17.000Z","end":"2019-05-13T09:39:21.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":37,"window":{"start":"2019-05-13T09:39:19.000Z","end":"2019-05-13T09:39:23.000Z"},"time":"2019-05-13T09:39:21.939Z"}
{"count":18,"window":{"start":"2019-05-13T09:39:21.000Z","end":"2019-05-13T09:39:25.000Z"},"time":"2019-05-13T09:39:21.939Z"}
I would like this:
{"count":47,"window":{"start":"2019-05-13T09:39:12.000Z","end":"2019-05-13T09:39:16.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":21,"window":{"start":"2019-05-13T09:39:13.000Z","end":"2019-05-13T09:39:17.000Z"},"time":"2019-05-13T09:39:15.026Z"}
{"count":42,"window":{"start":"2019-05-13T09:39:14.000Z","end":"2019-05-13T09:39:18.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":42,"window":{"start":"2019-05-13T09:39:15.000Z","end":"2019-05-13T09:39:19.000Z"},"time":"2019-05-13T09:39:17.460Z"}
{"count":40,"window":{"start":"2019-05-13T09:39:16.000Z","end":"2019-05-13T09:39:20.000Z"},"time":"2019-05-13T09:39:19.818Z"}
{"count":40,"window":{"start":"2019-05-13T09:39:17.000Z","end":"2019-05-13T09:39:21.000Z"},"time":"2019-05-13T09:39:19.818Z"}
The expected ouput wait for the window to be closed based on comparaison between the end timestamp and the current time.

Resources