Queries with streaming sources must be executed with writeStream.start() - apache-spark

I have a structured stream dataframe tempDataFrame2 consisting of Field1. I am trying to calculate the approxQuantile of Field1. However, whenever I type
val Array(Q1, Q3) = tempDataFrame2.stat.approxQuantile("Field1", Array(0.25, 0.75), 0.0) I get the following error message:
Queries with streaming sources must be executed with writeStream.start()
Below is the code snippet:
val tempDataFrame2 = A structured streaming dataframe
// Calculate IQR
val Array(Q1, Q3) = tempDataFrame2.stat.approxQuantile("Field1", Array(0.25, 0.75), 0.0)
// Filter messages
val tempDataFrame3 = tempDataFrame2.filter("Some working filter")
val query = tempDataFrame2.writeStream.outputMode("append").queryName("table").format("console").start()
query.awaitTermination()
I have already went through this two links from SO: Link1 Link2. Unfortunately, I am not able to relate those responses with my problem.
Edit
After reading the comments, following is the way I am planning to go ahead with:
1) Read all the uncommitted offset from the Kafka topic.
2) Save them to a dataframe variable.
3) Stop the structured streaming so that I don't read from the Kafka topic anymore.
4) Start processing the saved dataframe from step 2).
But, now I am not sure how to go ahead -
1) like how to know that I don't have any other records to consume in the Kafka topic and stop the streaming query?

Related

How to store data from a dataframe in a variable to use as a parameter in a select in cassandra?

I have a Spark Structured Streaming application. The application receives data from kafka, and should use these values ​​as a parameter to process data from a cassandra database. My question is how do I use the data that is in the input dataframe (kafka), as "where" parameters in cassandra "select" without taking the error below:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();
This is my df input:
val df = spark
.readStream
.format("kafka")
.options(
Map("kafka.bootstrap.servers"-> kafka_bootstrap,
"subscribe" -> kafka_topic,
"startingOffsets"-> "latest",
"fetchOffset.numRetries"-> "5",
"kafka.group.id"-> groupId
))
.load()
I get this error whenever I try to store the dataframe values ​​in a variable to use as a parameter.
This is the method I created to try to convert the data into variables. With that the spark give the error that I mentioned earlier:
def processData(messageToProcess: DataFrame): DataFrame = {
val messageDS: Dataset[Message] = messageToProcess.as[Message]
val listData: Array[Message] = messageDS.collect()
listData.foreach(x => println(x.country))
val mensagem = messageToProcess
mensagem
}
When you need to use data in Kafka to query data in Cassandra, then such operation is a typical join between two datasets - you don't need to call .collect to find entries, you just do the join. And it's quite typical thing - to enrich data in Kafka with data from the external dataset, and Cassandra provides low-latency operations.
Your code could look as following (you'll need to configure so-called DirectJoin, see link below):
import spark.implicits._
import org.apache.spark.sql.cassandra._
val df = spark.readStream.format("kafka")
.options(Map(...)).load()
... decode data in Kafka into columns
val cassdata = spark.read.cassandraFormat("table", "keyspace").load
val joined = df.join(cassdata, cassdata("pk") === df("some_column"))
val processed = ... process joined data
val query = processed.writeStream.....output data somewhere...start()
query.awaitTermination()
I have detailed blog post on how to perform efficient joins with data in Cassandra.
As the error message suggest, you have to use writeStream.start() in order to execute a Structured Streaming query.
You can't use the same actions you use for batch dataframes (like .collect(), .show() or .count()) on streaming dataframes, see the Unsupported Operations section of the Spark Structured Streaming documentation.
In your case, you are trying to use messageDS.collect() on a streaming dataset, which is not allowed. To achieve this goal you can use a foreachBatch output sink to collect the rows you need at each microbatch:
streamingDF.writeStream.foreachBatch { (microBatchDf: DataFrame, batchId: Long) =>
// Now microBatchDf is no longer a streaming dataframe
// you can check with microBatchDf.isStreaming
val messageDS: Dataset[Message] = microBatchDf.as[Message]
val listData: Array[Message] = messageDS.collect()
listData.foreach(x => println(x.country))
// ...
}

Kafka delete (tombstone) not updating max aggregate in Spark Structured Streaming

I am prototyping calculating aggregations in a Spark Structured Streaming (Spark 3.0) job and publishing the updates to Kafka. I need to calculate the max date and max percentage all time (no windowing) for each group. The code seems fine except for with Kafka tombstone records (deletes) in the source stream. The stream receives a Kafka record with a valid key and a null value, but the max aggregate continues to include the record in the calculation. What are the best options to have this recalculate without the deleted records when a delete is consumed from Kafka?
Example
Message produced:
<"user1|1", {"user": "user1", "pct":30, "timestamp":"2021-01-01 01:00:00"}>
<"user1|2", {"user": "user1", "pct":40, "timestamp":"2021-01-01 02:00:00"}>
<"user1|2", null>
Spark code snippet:
val usageStreamRaw = spark.readStream.format("kafka").option("kafka.bootstrap.servers", bootstrapServers).option("subscribe", usageTopic).load()
val usageStream = usageStreamRaw
.select(col("key").cast(StringType).as("key"),
from_json(col("value").cast(StringType), valueSchema).as("json"))
.selectExpr("key", "json.*")
val usageAgg = usageStream.groupBy("user")
.agg(
max("timestamp").as("maxTime"),
max("pct").as("maxPct")
)
val sq = usageAgg.writeStream.outputMode("update").option("truncate","false").format("console").start()
sq.awaitTermination()
For user1 the result in column pct is 40 but it should be 30 after deletion. Is there a good way to do this with Spark Structured Streaming?
You could make use of the Kafka timestamp in each message through
val usageStream = usageStreamRaw
.select(col("key").cast(StringType).as("key"),
from_json(col("value").cast(StringType), valueSchema).as("json"),
col("timestamp"))
.selectExpr("key", "json.*", "timestamp")
Then
select only the latest value for each key, and
filter out null values
before applying your aggregation on the maximum time and pct.

Multiple operations/aggregations on the same Dataframe/Dataset in Spark Structured Streaming

I use Spark 2.3.2.
I'm receiving data from Kafka. I must do multiple aggregations on the same data. Then all aggregations results will go to the same database (columns or tables may be changed). For example:
val kafkaSource = spark.readStream.option("kafka") ...
val agg1 = kafkaSource.groupBy().agg ...
val agg2 = kafkaSource.groupBy().mapgroupswithstate() ...
val agg3 = kafkaSource.groupBy().mapgroupswithstate() ...
But when I try call writeStream for each aggregation result:
aggr1.writeStream().foreach().start()
aggr2.writeStream().foreach().start()
aggr3.writeStream().foreach().start()
Spark receives data independently in each writeStream. Is this way efficient?
Can I do multiple aggregations with one writeStream? If it is possible, this way is efficient?
Every “writestream” operation results in a new streaming query. Every streaming query will read from the source and execute the entire query plan. Unlike DStream, there is no cache/persist option available.
In spark 2.4, a new API “forEachBatch” has been introduced to solve these kind of scenarios in a more efficient manner.
Caching can be used to avoid multiple reads:
kafkaSource.writeStream.foreachBatch((df, id) => {
df.persist()
val agg1 = df.groupBy().agg ...
val agg2 = df.groupBy().mapgroupswithstate() ...
val agg3 = df.groupBy().mapgroupswithstate() ...
df.unpersist()
}).start()

How to do a fast insertion of the data in a Kafka topic inside a Hive Table?

I have a Kafka topic in which I have received around 500k events.
Currently, I need to insert those events into a Hive table.
Since events are time-driven, I decided to use the following strategy:
1) Define a route inside HDFS, which I call users. Inside of this route, there will be several Parquet files, each one corresponding to a certain date. E.g.: 20180412, 20180413, 20180414, etc. (Format YYYYMMDD).
2) Create a Hive table and use the date in the format YYYYMMDD as a partition. The idea is to use each of the files inside the users HDFS directory as a partition of the table, by simply adding the corresponding parquet file through the command:
ALTER TABLE users DROP IF EXISTS PARTITION
(fecha='20180412') ;
ALTER TABLE users ADD PARTITION
(fecha='20180412') LOCATION '/users/20180412';
3) Read the data from the Kafka topic by iterating from the earliest event, get the date value in the event (inside the parameter dateClient), and given that date value, insert the value into the corresponding Parque File.
4) In order to accomplish the point 3, I read each event and saved it inside a temporary HDFS file, from which I used Spark to read the file. After that, I used Spark to convert the temporary file contents into a Data Frame.
5) Using Spark, I managed to insert the DataFrame values into the Parquet File.
The code follows this approach:
val conf = ConfigFactory.parseResources("properties.conf")
val brokersip = conf.getString("enrichment.brokers.value")
val topics_in = conf.getString("enrichment.topics_in.value")
val spark = SparkSession
.builder()
.master("yarn")
.appName("ParaTiUserXY")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val properties = new Properties
properties.put("key.deserializer", classOf[StringDeserializer])
properties.put("value.deserializer", classOf[StringDeserializer])
properties.put("bootstrap.servers", brokersip)
properties.put("auto.offset.reset", "earliest")
properties.put("group.id", "UserXYZ2")
//Schema para transformar los valores del topico de Kafka a JSON
val my_schema = new StructType()
.add("longitudCliente", StringType)
.add("latitudCliente", StringType)
.add("dni", StringType)
.add("alias", StringType)
.add("segmentoCliente", StringType)
.add("timestampCliente", StringType)
.add("dateCliente", StringType)
.add("timeCliente", StringType)
.add("tokenCliente", StringType)
.add("telefonoCliente", StringType)
val consumer = new KafkaConsumer[String, String](properties)
consumer.subscribe( util.Collections.singletonList("geoevents") )
val fs = {
val conf = new Configuration()
FileSystem.get(conf)
}
val temp_path:Path = new Path("hdfs:///tmp/tmpstgtopics")
if( fs.exists(temp_path)){
fs.delete(temp_path, true)
}
while(true)
{
val records=consumer.poll(100)
for (record<-records.asScala){
val data = record.value.toString
val dataos: FSDataOutputStream = fs.create(temp_path)
val bw: BufferedWriter = new BufferedWriter( new OutputStreamWriter(dataos, "UTF-8"))
bw.append(data)
bw.close
val data_schema = spark.read.schema(my_schema).json("hdfs:///tmp/tmpstgtopics")
val fechaCliente = data_schema.select("dateCliente").first.getString(0)
if( fechaCliente < date){
data_schema.select("longitudCliente", "latitudCliente","dni", "alias",
"segmentoCliente", "timestampCliente", "dateCliente", "timeCliente",
"tokenCliente", "telefonoCliente").coalesce(1).write.mode(SaveMode.Append)
.parquet("/desa/landing/parati/xyusers/" + fechaCliente)
}
else{
break
}
}
}
consumer.close()
However, this method is taking around 1 second to process each record in my cluster. So far, it would mean I will take around 6 days to process all the events I have.
Is this the optimal way to insert the whole amount of events inside a Kafka topic into a Hive table?
What other alternatives exist or which upgrades could I do to my code in order to speed it up?
Other than the fact that you're not using Spark Streaming correctly to poll from Kafka (you wrote a vanilla Scala Kafka consumer with a while loop) and coalesce(1) will always be a bottleneck as it forces one executor to collect the records, I'll just say you're really reinventing the wheel here.
What other alternatives exist
That I known of and are all open source
Gobblin (replaces Camus) by LinkedIn
Kafka Connect w/ HDFS Sink Connector (built into Confluent Platform, but also builds from source on Github)
Streamsets
Apache NiFi
Secor by Pinterest
From those listed, it would be beneficial for you to have JSON or Avro encoded Kafka messages, and not a flat string. That way, you can drop the files as is into a Hive serde, and not parse them while consuming them. If you cannot edit the producer code, make a separate Kafka Streams job taking the raw string data, parse it, then write to a new topic of Avro or JSON.
If you choose Avro (which you really should for Hive support), you can use the Confluent Schema Registry. Or if you're running Hortonworks, they offer a similar Registry.
HIve on Avro operates far better than text or JSON. Avro can easily be transformed into Parquet, and I believe each of the above options offers at least Parquet support while the others also can do ORC (Kafka Connect doesn't do ORC at this time).
Each of the above also support some level of automatic Hive partition generation based on the Kafka record time.
You can improve the parallelism by increasing the partitions of the kafka topic and having one or more consumer groups with multiple consumers consuming one-to-one with each partition.
As, cricket_007 mentioned you can use one of the opensource frameworks or you can have more consumer groups consuming the same topic to off-load the data.

Why does my Spark Streaming application not print the number of records from Kafka (using count operator)?

I am working on a spark application which needs to read data from Kafka. I created a Kafka topic where producer was posting messages. I verified from console consumer that messages were successfully posted .
I wrote a short spark application to read data from Kafka, but it is not getting any data.
Following is the code i used:
def main(args: Array[String]): Unit = {
val Array(zkQuorum, group, topics, numThreads) = args
val sparkConf = new SparkConf().setAppName("SparkConsumer").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
process(lines) // prints the number of records in Kafka topic
ssc.start()
ssc.awaitTermination()
}
private def process(lines: DStream[String]) {
val z = lines.count()
println("count of lines is "+z)
//edit
lines.foreachRDD(rdd => rdd.map(println)
// <-- Why does this **not** print?
)
Any suggestions on how to resolve this issue?
******EDIT****
I have used
lines.foreachRDD(rdd => rdd.map(println)
as well in actual code but that is also not working. I set the retention period as mentioned in post : Kafka spark directStream can not get data . But still the problem exist.
Your process is a continuation of a DStream pipeline with no output operator that gets the pipeline executed every batch interval.
You can "see" it by reading the signature of count operator:
count(): DStream[Long]
Quoting the count's scaladoc:
Returns a new DStream in which each RDD has a single element generated by counting each RDD of this DStream.
So, you have a dstream of Kafka records that you transform to a dstream of single values (being the result of count). Not much to have it outputed (to a console or any other sink).
You have to end the pipeline using an output operator as described in the official documentation Output Operations on DStreams:
Output operations allow DStream’s data to be pushed out to external systems like a database or a file systems. Since the output operations actually allow the transformed data to be consumed by external systems, they trigger the actual execution of all the DStream transformations (similar to actions for RDDs).
(Low-Level) Output operators register input dstreams as output dstreams so the execution can start. Spark Streaming's DStream by design has no notion of being an output dstream. It is DStreamGraph to know and be able to differentiate between input and output dstreams.

Resources