Here is my data:
+---+-----------+-----+
|key|animal_type|value|
+---+-----------+-----+
|123| cat|meows|
|456| dog|barks|
+---+-----------+-----+
I am currently writing to Eventhub from databricks like so:
(df.select("key","value").writeStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrap_server)
.option("topic","cats").option("kafka.security.protocol", "SASL_SSL")
.option("kafka.sasl.mechanism", "PLAIN")
.option("kafka.sasl.jaas.config", connection_string)
.option("kafka.request.timeout.ms", "3600000")
.option("checkpointLocation", checkpoint_path)
.start() )
My challenge is, since the dataframe contains records about both cats and dogs, I don't want to hardcode the topic as 'cats' since sometimes it will be 'dogs'. Instead, I would like a way for EventHub to dynamically assign the topic based on the value in the column animal_type.
Is this possible? Or do I need to have a single dataframe/write-config per topic?
You need to alias the animal_type column to topic prior to writing, then remove the topic option.
Refer Spark Structured Streaming documentation for other ways to structure the dataframe when writing to Kafka API
Related
I'm porting a streaming job (Kafka topic -> AWS S3 Parquet Files) from Kafka Connect to Spark Structured Streaming Job.
I partition my data by year/month/day.
The code is very simple:
df.withColumn("year", functions.date_format(col("createdAt"), "yyyy"))
.withColumn("month", functions.date_format(col("createdAt"), "MM"))
.withColumn("day", functions.date_format(col("createdAt"), "dd"))
.writeStream()
.trigger(processingTime='15 seconds')
.outputMode(OutputMode.Append())
.format("parquet")
.option("checkpointLocation", "/some/checkpoint/directory/")
.option("path", "/some/directory/")
.option("truncate", "false")
.partitionBy("year", "month", "day")
.start()
.awaitTermination();
The output files are in the following directory (as expected):
/s3-bucket/some/directory/year=2021/month=01/day=02/
Question:
Is there a way to customize the output directory name? I need it to be
/s3-bucket/some/directory/2021/01/02/
For backward compatibility reasons.
No, there is no way to customize the output directory names into the format you have mentioned within your Spark Structured Streaming application.
Partitions are based on the values of particular columns and without their column names in the directory path it would be ambiguous to which column their value belong to. You need to write a seperate application that transforms those directories into the desired format.
I have a scenario where I would like to save the same streaming dataframe to two different streaming sinks.
I have created a streaming dataframe which I need to send to both Kafka topic and delta lake.
I thought of using forEachBatch, but looks like it doesn't support multiple STREAMING SINKS.
Also, I tried using spark session.awaitAnyTermination() with multiple write streams. But the second stream is not getting processed !
Is there a way through which we can achieve this ?!
This is my code:
I am reading from Kafka stream and creating a single streaming dataframe.
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "ingestionTopic1")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)").as[(String, String)]
writing the above dataframe to a Kafka topic
val ds1 = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9082")
.option("topic", "outputTopic1")
.start()
writing the same streaming dataframe to delta lake
val ds2 = df.format("delta")
.outputMode("append")
.option("checkpointLocation", "/test/delta/events/_checkpoints/etlflow")
.start("/test/delta/events")
ds1.awaitTermination
ds2.awaitTermination
There are a few things you need to follow to use one input stream for multiple output streams:
You need to make sure to have two different checkpointLocations in the two output streams.
Furthermore, you need to ensure to have the writeStream call also on your second output query.
Overall, it is important to start both of the queries before waiting for the termination of both queries. (You are already doing this)
Considering data from both the topics are joined at one point and sent to Kafka sink finally which is the best way to read from multiple topics
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", servers)
.option("subscribe", "t1,t2")
vs
val df1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", servers)
.option("subscribe", "t1")
val df2 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", servers)
.option("subscribe", "t2")
Somewhere i will df1.join(df2) and send it to Kafka sink.
With respect to performance and resource usage wise which would be the better option here?
Thanks in advance
PS : I see another similar question Spark structured streaming app reading from multiple Kafka topics but there dataframes from 2 topics seems to be not used together
In first approach, you'd have to add a filter at some point and then proceed with join. Unless, you also want to process both the streams together, 2nd approach is tidbit more performant and easier to maintain.
I'd say approach 2 is a straightforward one and skips a filter stage, hence a little bit more efficient. It also offers autonomy in both the streams from infra point of view, example: one of topic was to move to a new kafka cluster. You also don't have to account for unevenness between two topics, example: number of partitions. This makes job tuning easier.
I want to load all records from kafka topic using spark, but all examples which I have seen were using spark-streaming. How can can I load messages fwom kafka exactly once?
Exact steps are listed in the official documentation, for example:
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.load()
However "all records" is rather poorly defined if the source is continuous stream, as the result depends on the point in time, when query is executed.
Additionally you should keep in mind that parallelism is limited by the partitions of the Kafka topic, so you have to be careful not to overwhelm the cluster.
Currently, I have the following df
+-------+--------------------+-----+
| key| created_at|count|
+-------+--------------------+-----+
|Bullish|[2017-08-06 08:00...| 12|
|Bearish|[2017-08-06 08:00...| 1|
+-------+--------------------+-----+
I use the following to stream the data to Kafka
df.selectExpr("CAST(key AS STRING) AS key", "to_json(struct(*)) AS value")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092").option("topic","chart3").option("checkpointLocation", "/tmp/checkpoints2")
.outputMode("complete")
.start()
The problem here is that, for each of the row in DataFrame, it will write to Kafka one by one. My consumer will get the message one by one.
Is there any way to consolidate all rows into an array and stream to Kafka, such that my consumer can get the whole data in one go.
Thanks for advice.
My consumer will get the message one by one.
Not exactly. It may depend on Kafka property. You can specify own properties and use for example:
props.put("batch.size", 16384);
In the background Spark uses normal cached KafkaProducer. It will use properties that you will provide in options when submitting query.
See also Java Doc. Be aware, that it may not scale correctly