I'm strangling Kafka spark streaming with dynamic schema.
I"m consuming from Kafka (KafkaUtils.createDirectStream) each message /JSON field can be nested, each field can appear in some messages and sometimes not.
The only thing I found is to do:
Spark 2.0 implicit encoder, deal with missing column when type is Option[Seq[String]] (scala)
case class MyTyp(column1: Option[Any], column2: Option[Any]....)
This will cover,im not sure, fields that may appear, and nested Fileds.
Any approval/other Ideas/general help will be appreciated ...
After long integration and trails, two ways to solve non schema Kafka consuming: 1) Throw "editing/validation" each message with "lambda" function .not my favorite. 2) Spark: on each micro batch obtain flatten schema and intersect needed columns. use spark SQL to query the frame for needed data. That worked for me.
Related
I have a spark application that has to process multiple queries in parallel using a single Kafka topic as the source.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
What would be the recommended way to improve performance in the scenario above ? Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Any thoughts are welcome,
Thank you.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
tl;dr Not possible in the current design.
A single streaming query "starts" from a sink. There can only be one in a streaming query (I'm repeating it myself to remember better as I seem to have been caught multiple times while with Spark Structured Streaming, Kafka Streams and recently with ksqlDB).
Once you have a sink (output), the streaming query can be started (on its own daemon thread).
For exactly the reasons you mentioned (not to share data for which Kafka Consumer API requires group.id to be different), every streaming query creates a unique group ID (cf. this code and the comment in 3.3.0) so the same records can be transformed by different streaming queries:
// Each running query should use its own group id. Otherwise, the query may be only assigned
// partial data since Kafka will assign partitions to multiple consumers having the same group
// id. Hence, we should generate a unique id for each query.
val uniqueGroupId = KafkaSourceProvider.batchUniqueGroupId(sourceOptions)
And that makes sense IMHO.
Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Guess so.
You can separate your source data frame into different stages, yes.
val df = spark.readStream.format("kafka") ...
val strDf = df.select(cast('value).as("string")) ...
val df1 = strDf.filter(...) # in "parallel"
val df2 = strDf.filter(...) # in "parallel"
Only the first line should be creating Kafka consumer instance(s), not the other stages, as they depend on the consumer records from the first stage.
I was wondering if anyone knew what the difference between the two syntax is? I know both are used to read data from Kafka but what differentiates them?
spark.readStream.format("kafka")
KafkaUtils.createDirectStream(__)
They are part of different dependencies, for one.
The first is for Structured Streaming, and returns Dataframes, and is considered the preferred API for Spark
The second is for RDD Spark Streaming operations where the data might not have any consistency to it (a structure), or if you did want more direct access to the lower level ConsumerRecord object of Spark
After the batch, Spark ETL I need to write to Kafka topic the resulting DataFrame that contains multiple different columns.
According to the following Spark documentation https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html the Dataframe being written to Kafka should have the following mandatory column in schema:
value (required) string or binary
As I mentioned previously, I have much more columns with values so I have a question - how to properly send the whole DataFrame row as a single message to Kafka topic from my Spark application? Do I need to join all of the values from all columns into the new DataFrame with a single value column(that will contain the joined value) or there is more proper way to achieve it?
The proper way to do that is already hinted by the docs, and doesn't really differ form what you'd do with any Kafka client - you have to serialize the payload before sending to Kafka.
How you you'll do that (to_json, to_csv, Apache Avro) depends on your business requirements - nobody can answers this but you (or your team).
I have a structured streaming query which sinks to Kafka. This query has a complex aggregation logic.
I would like to sink the output DF of this query to multiple Kafka topics each partitioned on a different ‘key’ column. I don't want to have multiple Kafka sinks for each of the different Kafka topics because that would mean running multiple streaming queries - one for each Kafka topic, especially since my aggregation logic is complex.
Questions:
Is there a way to output the results of a structured streaming query to multiple Kafka topics each with a different key column but without having to execute multiple streaming queries?
If not, would it be efficient to cascade the multiple queries such that the first query does the complex aggregation and writes output to Kafka and then the other queries just read the output of the first query and write their topics to Kafka thus avoiding doing the complex aggregation again?
Thanks in advance for any help.
So the answer was kind of staring at me in the eye. It's documented as well. Link below.
One can write to multiple Kafka topics from a single query. If your dataframe that you want to write has a column named "topic" (along with "key", and "value" columns), it will write the contents of a row to the topic in that row. This automatically works. So the only thing you need to figure out is how to generate the value of that column.
This is documented - https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#writing-data-to-kafka
I am also looking for solution of this problem and in my case its not necessarily kafka sink. I want to write some records of a dataframe in sink1 while some other records in sink2 (depending upon some condition, without reading the same data twice in 2 streaming queries).
Currently it does not seem possible as per current implementation ( createSink() method in DataSource.scala provides support for a single sink).
However, In Spark 2.4.0 there is a new api coming: foreachBatch() which will give handle to a dataframe microbatch which can be used to cache the dataframe, write to different sinks or processing multiple times before uncaching aagin.
Something like this:
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.cache()
batchDF.write.format(...).save(...) // location 1
batchDF.write.format(...).save(...) // location 2
batchDF.uncache()
}
right now this feature available in databricks runtime :
https://docs.databricks.com/spark/latest/structured-streaming/foreach.html#reuse-existing-batch-data-sources-with-foreachbatch
EDIT 15/Nov/18 :
It is available now in Spark 2.4.0 ( https://issues.apache.org/jira/browse/SPARK-24565)
There is no way to have a single read and multiple writes in structured streaming out of the box. The only way is to implement custom sink that will write into multiple topics.
Whenever you call dataset.writeStream().start() spark starts a new stream that reads from a source (readStream()) and writes into a sink (writeStream()).
Even if you try to cascade it spark will create two separate streams with one source and one sink each. In other words, it will read, process and write data twice:
Dataset df = <aggregation>;
StreamingQuery sq1 = df.writeStream()...start();
StreamingQuery sq2 = df.writeStream()...start();
There is a way to cache read data in spark streaming but this option is not available for structured streaming yet.
I need to store the values from kafka->spark streaming->cassandra.
Now, I am receiving the values from kafka->spark and I have a spark job to save values into the cassandra db. However, I'm facing a problem with the datatype dstream.
In this following snippet you can see how I'm trying to convert the DStream into python friendly list object so that I can work with it but it gives an error.
input at kafka producer:
Byrne 24 San Diego robbyrne#email.com Rob
spark-job:
map1={'spark-kafka':1}
kafkaStream = KafkaUtils.createStream(stream, 'localhost:2181', "name", map1)
lines = kafkaStream.map(lambda x: x[1])
words = lines.flatMap(lambda line: line.split(" "))
words.pprint() # outputs-> Byrne 24 SanDiego robbyrne#email.com Rob
list=[lambda word for word in words]
#gives an error -> TypeError: 'TransformedDStream' object is not iterable
This is how I'm saving values from spark->cassandra
rdd2=sc.parallelize([{
... "lastname":'Byrne',
... "age":24,
... "city":"SanDiego",
... "email":"robbyrne#email.com",
... "firstname":"Rob"}])
rdd2.saveToCassandra("keyspace2","users")
What's the best way of converting the DStream object to a dictionary or what's the best way of doing what I'm trying to do here?
I just need the values received from kafka (in the form of DStream) to be saved in Cassandra.
Thanks and any help would be nice!
Versions:
Cassandra v2.1.12
Spark v1.4.1
Scala 2.10
Like everything 'sparky', I think a short explanation is due since even if you are familiar with RDDs, DStreams are of an even higher concept:
A Discretized Stream (DStream), is a continuous sequence of RDDs of the same type, representing a continuous stream of data. In your case, DStreams are created from live Kafka data.
While a Spark Streaming program is running, each DStream periodically generates a RDD from live Kafka data
Now, to iterate over received RDDs, you need to use DStream#foreachRDD (and as implied by its name, it serves a similar purpose as foreach, but this time, to iterate over RDDs).
Once you have an RDD, you can invoke rdd.collect() or rdd.take() or any other standard API for RDDs.
Now, as a closing note, to make things even more fun, Spark introduced a new receiver-less “direct” approach to ensure stronger end-to-end guarantees.
(KafkaUtils.createDirectStream which requires Spark 1.3+)
Instead of using receivers to receive data, this approach periodically queries Kafka for the latest offsets in each topic+partition, and accordingly defines the offset ranges to process in each batch. When the jobs to process the data are launched, Kafka’s simple consumer API is used to read the defined ranges of offsets from Kafka.
(which is a nice way to say you will have to "mess" with the offsets yourself)
See Direct Streams Approach for further details.
See here for a scala code example
According to the official doc of the spark-cassandra connector: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/8_streaming.md
import com.datastax.spark.connector.streaming._
val ssc = new StreamingContext(conf, Seconds(n))
val stream = ...
val wc = stream
.map(...)
.filter(...)
.saveToCassandra("streaming_test", "words", SomeColumns("word", "count"))
ssc.start()
Actually, I found the answer in this tutorial http://katychuang.me/blog/2015-09-30-kafka_spark.html.