Where should I put my credential data streaming with Kafka in databricks? - apache-spark

I have some values in Azure Key Vault (AKV)
A simple initial googling was giving me
username = dbutils.secrets.get(scope = "DATAAI-CEC", key = "dai-kafka-cec-api-key")
pwd = dbutils.secrets.get(scope = "DATAAI-CEC", key = "dai-kafka-cec-secret")
from kafka import KafkaConsumer
consumer = KafkaConsumer('TOPIC',
bootstrap_servers = 'SERVER:PORT',
enable_auto_commit = False,
auto_offset_reset = 'earliest',
consumer_timeout_ms = 2000,
security_protocol = 'SASL_SSL',
sasl_mechanism = 'PLAIN',
sasl_plain_username = username,
sasl_plain_password = pwd)
This one works one time when the cell in databricks runs, however, after a single run it is finished, and it is not listening to Kafka messages anymore, and the cluster goes to the off state after the configured time (in my case 30 minutes)
So it doesn't solve my problem
My next google search was this blog on databricks (Processing Data in Apache Kafka with Structured Streaming in Apache Spark 2.2)
from pyspark.sql.types import *
from pyspark.sql.functions import from_json
from pyspark.sql.functions import *
schema = StructType() \
.add("EventHeader", StructType() \
.add("UUID", StringType()) \
.add("APPLICATION_ID", StringType())
.add("FORMAT", StringType())) \
.add("EmissionReportMessage", StructType() \
.add("reportId", StringType()) \
.add("startDate", StringType()) \
.add("endDate", StringType()) \
.add("unitOfMeasure", StringType()) \
.add("reportLanguage", StringType()) \
.add("companies", ArrayType(StructType([StructField("ccid", StringType(), True)]))))
parsed_kafka = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "SERVER:PORT") \
.option("subscribe", "TOPIC") \
.option("startingOffsets", "earliest") \
.load()\
.select(from_json(col("value").cast("string"), schema).alias("kafka_parsed_value"))
There are some issues
Where should I put my GenID or user/pass info?
When I run the display command, it runs, but it will never stop, and it will never show the result

however, after a single run it is finished, and it is not listening to Kafka messages anymore
Given that you have enable_auto_commit = False, it should continue to work on following runs. But this isn't using Spark...
Where should I put my GenID or user/pass info
You would add SASL/SSL properties into option() parameters.
Ex. For SASL_PLAIN
option("kafka.sasl.jaas.config",
'org.apache.kafka.common.security.plain.PlainLoginModule required username="{}" password="{}";'.format(username, password))
See related question
it will never stop
Because you run a streaming query starting with readStream rather than a batched read.
it will never show the result
You'll need to use parsed_kafka.writeStream.format("console"), for example somewhere (assuming you want to start with readStream, rather than display() and read

Related

Is there a way to ensure scale of records while streaming from kafka?

I'm new to Spark and Kafka, using pyspark (spark 2.4.8).
Assume we have a kafka streaming source and we want to stream at least N records to our database. What is the best way to ensure the wanted number of records and stop after reaching it?
I thought maybe to count the number of micro-batches using a global parameter, and to limit the number of offsets per micro-batch but I guess it isn't the right way to get over the problem.
My code in general:
raw_stream_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", kafka_server) \
.option("subscribe", "topic1, topic2") \
.option("startingOffsets", "earliest") \
.option("maxOffsetsPerTrigger", offsets_number) \
.load()
...
# define schema (not relevant)
...
counter = 0
def foreach_batch_function(df, epoch_id):
global counter
counter += 1
query = streaming_df \
.writeStream \
.outputMode("append") \
.format("memory") \
.queryName("query1") \
.foreachBatch(foreach_batch_function) \
.start()
Buy It didn't work. I tried to stop the query after reaching a const number of micro-batches but the counter even didn't increase.
Back to my question, what is the right way to pass the lower bound of requested records and than just stop?

Databricks: Structured Stream fails with TimeoutException

I want to create a structured stream in databricks with a kafka source.
I followed the instructions as described here. My script seems to start, however it fails with the first element of the stream. The stream itsellf works fine and produces results and works (in databricks) when I use confluent_kafka, thus there seems to be a different issue I am missing:
After the initial stream is processed, the script times out:
java.util.concurrent.TimeoutException: Stream Execution thread for stream [id = 80afdeed-9266-4db4-85fa-66ccf261aee4,
runId = b564c626-9c74-42a8-8066-f1f16c7ab53d] failed to stop within 36000 milliseconds (specified by spark.sql.streaming.stopTimeout). See the cause on what was being executed in the streaming query thread.`
WHAT I TRIED: looking at SO and finding this answer, to which I included
spark.conf.set("spark.sql.streaming.stopTimeout", 36000)
into my setup - which changed nothing.
Any input is highly appreciated!
from pyspark.sql import functions as F
from pyspark.sql.types import *
# Define a data schema
schema = StructType() \
.add('PARAMETERS_TEXTVALUES_070_VALUES', StringType())\
.add('ID', StringType())\
.add('PARAMETERS_TEXTVALUES_001_VALUES', StringType())\
.add('TIMESTAMP', TimestampType())
df = spark \
.readStream \
.format("kafka") \
.option("host", "stream.xxx.com") \
.option("port", 12345)\
.option('kafka.bootstrap.servers', 'stream.xxx.com:12345') \
.option('subscribe', 'stream_test.json') \
.option("startingOffset", "earliest") \
.load()
df_word = df.select(F.col('key').cast('string'),
F.from_json(F.col('value').cast('string'), schema).alias("parsed_value"))
df_word \
.writeStream \
.format("parquet") \
.option("path", "dbfs:/mnt/streamfolder/stream/") \
.option("checkpointLocation", "dbfs:/mnt/streamfolder/check/") \
.outputMode("append") \
.start()
my stream output data looks like this:
"PARAMETERS_TEXTVALUES_070_VALUES":'something'
"ID":"47575963333908"
"PARAMETERS_TEXTVALUES_001_VALUES":12345
"TIMESTAMP": "2020-10-22T15:06:42.507+02:00"
Furthermore, stream and check folders are filled with 0-b files, except for metadata, which includes the ìd from the error above.
Thanks and stay safe.

How do you call multiple writeStream operations within a single Spark Job?

I am trying to write a Spark Structured Streaming job that reads from a Kafka topic and writes to separate paths (after performing some transformations) via the writeStream operation. However, when I run the following code, only the first writeStream gets executed and the second is getting ignored.
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
.option("subscribe", "topic1") \
.load()
write_one = df.writeStream \
.foreachBatch(lambda x, y: transform_and_write_to_zone_one(x,y)) \
.start() \
.awaitTermination()
// transform df to df2
write_two = df2.writeStream \
.foreachBatch(lambda x, y: transform_and_write_to_zone_two(x,y)) \
.start() \
.awaitTermination()
I initially thought that my issue was related to this post, however, after changing my code to the following:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
.option("subscribe", "topic1") \
.load()
write_one = df.writeStream \
.foreachBatch(lambda x, y: transform_and_write_to_zone_one(x,y)) \
.start()
// transform df to df2
write_two = df2.writeStream \
.foreachBatch(lambda x, y: transform_and_write_to_zone_two(x,y)) \
.start()
write_one.awaitTermination()
write_two.awaitTermination()
I received the following error:
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
I am not sure why the additional code between start() and awaitTermination() would cause the error above (but I think this is probably a separate issue that is referenced in this answer to the same post above). What is the correct way to call multiple writeStream operations within the same job? Would it be best to have both of the writes within the function that is invoked by foreachBatch or is there are a better way to achieve this?
Spark documentation says that in case you need perform writing into multiple locations you need to use foreachBatch method.
Your code should look something like:
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.persist()
batchDF.write.format(...).save(...) // location 1
batchDF.write.format(...).save(...) // location 2
batchDF.unpersist()
}
Note: persist in needed in order to prevent recomputations.
You can check more: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch
You just don't call awaiTermination() for each of your stream queries, but just one through spark session, eg:
spark.streams.awaitAnyTermination()

Spark Streaming Kafka - How to stop streaming after processing all existing messages (gracefully)

This is what i am trying to do
Stream data from a kafka topic, which keeps getting data continuously.
Run the job twice a day, to process all data existing data at that point and stop the stream.
So i put and call stop on the query initially, but it was throwing "TimeoutException"
Then i tried increasing the timeout dynamically, but now i am getting java.io.IOException: Caused by: java.lang.InterruptedException
So, is there any way to gracefully stop the stream without getting any exceptions?
Below is my current code (part), which is throwing the interrupted exception
df = (
spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", os.environ["KAFKA_SERVERS"])
.option("subscribe", config.kafka.topic)
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 25000)
.load()
)
# <do some processing and save the data>
def save_batch(batch_df, batch_id):
pass
query = df.writeStream.foreachBatch(save_batch).start(
outputMode="append",
checkpointLocation=os.path.join(checkpoint_path, config.kafka.topic),
)
while query.isActive:
progress = query.lastProgress
if progress and progress["numInputRows"] < 25000 * 0.9:
timeout = sum(progress["durationMs"].values())
timeout = min(5 * 60 * 1000, max(15000, timeout))
spark.conf.set("spark.sql.streaming.stopTimeout", str(timeout))
stream_query.stop()
break
time.sleep(10)
Spark Version: 2.4.5
Scala Version: 2.1.1
Update: With Spark 3.3 .trigger(availableNow=True) is an option that will play nicely with .option("maxOffsetsPerTrigger", 25000).
I would recommend .trigger(once=True) and .awaitTermination() (https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers).
Warning: This will not work with .option("maxOffsetsPerTrigger", 25000), but if maxOffsetsPerTrigger is not set it will default to pulling all offsets since it was last run to create one large micro-batch.
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", os.environ["KAFKA_SERVERS"]) \
.option("subscribe", config.kafka.topic) \
.option("startingOffsets", "earliest") \
.load()
def foreach_batch_function(df, epoch_id):
# Transform and write batchDF
pass
df \
.writeStream \
.foreachBatch(foreach_batch_function) \
.trigger(once=True) \
.start(
outputMode="append",
checkpointLocation=os.path.join(checkpoint_path, config.kafka.topic),
) \
.awaitTermination()

Exception has occurred: pyspark.sql.utils.AnalysisException 'Queries with streaming sources must be executed with writeStream.start();;\nkafka'

at the code
if not df.head(1).isEmpty:
I got exception,
Exception has occurred: pyspark.sql.utils.AnalysisException 'Queries with streaming sources must be executed with writeStream.start();;\nkafka'
I do not know how to use if in streaming data.
when I use jupyter, to execute each line, the code is well, and I can got my result. but use .py it's not good.
my perpose is this: I want use streaming to get data from kafka every one second, then I transform every batch steaming data(one batch means the data one second I get) to pandas dataframe, and then I use pandas function to do something to the data, finally I send the result to other kafka topic.
Please help me, and forgive my pool english, Thanks a lot.
sc = SparkContext("local[2]", "OdometryConsumer")
spark = SparkSession(sparkContext=sc) \
.builder \
.appName("StructuredNetworkWordCount") \
.getOrCreate()
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "data") \
.load()
ds = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
print(type(ds))
if not df.head(1).isEmpty:
alertQuery = ds \
.writeStream \
.queryName("qalerts")\
.format("memory")\
.start()
alerts = spark.sql("select * from qalerts")
pdAlerts = alerts.toPandas()
a = pdAlerts['value'].tolist()
d = []
for i in a:
x = json.loads(i)
d.append(x)
df = pd.DataFrame(d)
print(df)
ds = df['jobID'].unique().tolist()
dics = {}
for source in ds:
ids = df.loc[df['jobID'] == source, 'id'].tolist()
dics[source]=ids
print(dics)
query = ds \
.writeStream \
.queryName("tableName") \
.format("console") \
.start()
query.awaitTermination()
Remove if not df.head(1).isEmpty: and you should be fine.
The reason for the exception is simple, i.e. a streaming query is a structured query that never ends and is continually executed. It is simply not possible to look at a single element since there is no "single element", but (possibly) thousands of elements and it'd be hard to tell when exactly you'd like to look under the covers and see just a single element.

Resources