Spark Structured Streaming not able to see the record details - apache-spark

I am trying to process the records from readstream and just try to print the row.
How ever in my driver logs or executor logs cant see any printed statements.
What might be wrong ?
For every record or batch( ideally) i want to print the message
for every batch , i want to execute a process.
val kafka = spark.readStream
.format("kafka")
.option("maxOffsetsPerTrigger", MAX_OFFSETS_PER_TRIGGER)
.option("kafka.bootstrap.servers", BOOTSTRAP_SERVERS)
.option("subscribe", topic) // comma separated list of topics
.option("startingOffsets", "earliest")
.option("checkpointLocation", CHECKPOINT_LOCATION)
.option("failOnDataLoss", "false")
.option("minPartitions", sys.env.getOrElse("MIN_PARTITIONS", "64").toInt)
.load()
import spark.implicits._
println("JSON output to write into sink")
val consoleOutput = kafka.selectExpr("CAST(key AS STRING) as key", "CAST(value AS STRING) as value")
//.select(from_json($"json", schema) as "data")
//.select("data.*")
//.select(get_json_object(($"value").cast("string"), "$").alias("body"))
.writeStream
.foreach(new ForeachWriter[Row] {
override def open(partitionId: Long, epochId: Long): Boolean = true
override def process(row: Row): Unit = {
logger.info(
s"Record received in data frame is -> " + row.mkString )
runProcess() // Want to run some process every microbatch
}
override def close(errorOrNull: Throwable): Unit = {}
})
.outputMode("append")
.format("console")
.trigger(Trigger.ProcessingTime("30 seconds"))
.start()
consoleOutput.awaitTermination()
}

I copied your code and it is running fine without the runProcess function call.
If you are planning to do two different things I recommend to have two separate queries after selecting the relevant fields from Kafka topic:
val kafkaSelection = kafka.selectExpr("CAST(key AS STRING) as key", "CAST(value AS STRING) as value")
1. For every record or batch( ideally) i want to print the message
val query1 = kafkaSelection
.writeStream
.outputMode("append")
.format("console")
.trigger(Trigger.ProcessingTime("30 seconds"))
.option("checkpointLocation", CHECKPOINT_LOCATION1)
.start()
2. for every batch , i want to execute a process.
val query2 = kafkaSelection
.writeStream
.foreach(new ForeachWriter[Row] {
override def open(partitionId: Long, epochId: Long): Boolean = true
override def process(row: Row): Unit = {
logger.info(
s"Record received in data frame is -> " + row.mkString )
runProcess() // Want to run some process every microbatch
}
override def close(errorOrNull: Throwable): Unit = {}
})
.outputMode("append")
.option("checkpointLocation", CHECKPOINT_LOCATION2)
.trigger(Trigger.ProcessingTime("30 seconds"))
.start()
Also note that I have set the checkpoint location for each query individually which will ensure a consistent tracking of the Kafka offsets. Make sure to have two different checkpoint location for each query. You can run both queries in parallel.
It is important to define both queries before waiting for their termination:
query1.awaitTermination()
query2.awaitTermination()
Tested with Spark 2.4.5:

Related

What is the best way to perform multiple filter operations on spark streaming dataframe read from Kafka?

I need to apply multiple filters on a DataFrame read from a Kafka topic and publish output of each of these filter to an external system (like another Kafka topic).
I have read the kafkaDF like this
val kafkaDF: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "try.kafka.stream")
.load()
.select(col("topic"), expr("cast(value as string) as message"))
.filter(col("message").isNotNull && col("message") =!= "")
.select(from_json(col("message"), eventsSchema).as("eventData"))
.select("eventData.*")
I am able to run a foreachBatch on this Dataframe and then iterate over the list of filters to get the filtered data which then can be published to a kafka topic, as shown below
kafkaDF.writeStream
.foreachBatch { (batch: DataFrame, _: Long) =>
// List of filters that needs to be applied
filterList.par.foreach(filterString => {
val filteredDF = batch.filter(filterString)
// Add some columns.
// Do some operations based on different filter
filteredDF.toJSON.foreach(value => {
// Publish a message to Kafka
})
})
}
.trigger(Trigger.ProcessingTime("60 seconds"))
.start()
.awaitTermination()
But, I am not sure if this is the best way given so many iterations. Is there a better way than doing it like this?
If you plan to write data from one Kafka topic into multiple Kafka topics you can create a column called "topic" in a single Dataframe when writing to Kafka. The value in this column then defines the topic in which a record will be produced. This allows you to write to as many different Kafka topics as required.
Therefore, I would just apply your filter logic as a when/otherwise condition or, if more complex, as a UDF.
Below is an example code that should get you started. Based on the value of the consumed Kafka message, a column called "topic" gets created in the filteredDf. If value = 1 then the Dataframe record gets produced into the topic called "out1", and otherwise the recod gets produced into topic called "out2".
val inputDf = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "try.kafka.stream")
.option("failOnDataLoss", "false")
.load()
.selectExpr("CAST(key AS STRING) as key", "CAST(value AS STRING) as value", "partition", "offset", "timestamp")
val filteredDf = inputDf.withColumn("topic", when(filter, lit("out1")).otherwise(lit("out2")))
val query = filteredDf
.select(
col("key"),
to_json(struct(col("*"))).alias("value"),
col("topic"))
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "/home/michael/sparkCheckpoint/1/")
.start()
query.awaitTermination()
EDIT: (I might have misunderstood your question initially)
If you just want to find a good way to apply multiple filters out of your filterList you can combine them using foldLeft:
val filter1 = col("value") === 1
val filter2 = col("key") === 1
val filterList = List(filter1, filter2)
val filterAll = filterList.tail.foldLeft(filterList.head)((f1, f2) => f1.and(f2))
println(filterAll)
((value = 1) AND (key = 1))
Then apply .filter(filterAll) to your Dataframe.

How to include both "latest" and "JSON with specific Offset" in "startingOffsets" while importing data from Kafka into Spark Structured Streaming

I have a streaming query saving data into filesink. I am using .option("startingOffsets", "latest") and a checkpoint location. If there is any down time on Spark and when the streaming query starts again i do not want to start processing where the query left off when it went down rather than this scenario i would also like to add ("startingOffsets", """ {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}} """) by specifying the user defined offset which needs to process from.
i tried doing this with different programs but i need to achieve this in one single program
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.Trigger
object OSB_offset_kafkaToSpark {
def main(args: Array[String]): Unit = {
val spark = SparkSession.
builder().
appName("OSB_kafkaToSpark").
config("spark.mongodb.output.uri", "spark.mongodb.output.uri=mongodb://somemongodb.com:27018").
getOrCreate()
println("SparkSession -> "+spark)
import spark.implicits._
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "somekafkabroker:9092, somekafkabroker:9092")
.option("subscribe", "someTopic")
.option("startingOffsets", "latest")
.option("startingOffsets",""" {"someTopic":{"0":438521}}, "someTopic":{"1":438705}}, "someTopic":{"2":254180}}""")
.option("endingOffsets",""" {"someTopic":{"0":-1}}, "someTopic":{"1":-1}}, "someTopic":{"2":-1}} """)
.option("failOnDataLoss", "false")
.load()
val dfs = df.selectExpr("CAST(value AS STRING)")
val data = dfs.withColumn("splitted", split($"value", "/"))
.select($"splitted".getItem(4).alias("region"), $"splitted".getItem(5).alias("service"), col("value"))
.withColumn("service_type", regexp_extract($"service", """.*(Inbound|Outbound|Outound).*""", 1))
.withColumn("region_type", concat(
when(col("region").isNotNull, col("region")).otherwise(lit("null")), lit(" "),
when(col("service").isNotNull, col("service_type")).otherwise(lit("null"))))
.withColumn("datetime", regexp_extract($"value", """\d{4}-[01]\d-[0-3]\d [0-2]\d:[0-5]\d:[0-5]\d""", 0))
val extractedDF = data.filter(
col("region").isNotNull &&
col("service").isNotNull &&
col("value").isNotNull &&
col("service_type").isNotNull &&
col("region_type").isNotNull &&
col("datetime").isNotNull)
.filter("region != ''")
.filter("service != ''")
.filter("value != ''")
.filter("service_type != ''")
.filter("region_type != ''")
.filter("datetime != ''")
val pathstring = "/user/spark_streaming".concat(args(0))
val query = extractedDF.writeStream
.format("json")
.option("path", pathstring)
.option("checkpointLocation", "/user/some_checkpoint")
.outputMode("append")
.trigger(Trigger.ProcessingTime("5 seconds"))
.start()
query.awaitTermination()
}
}
I need run a single program with both .option("startingOffsets", "latest") and .option("startingOffsets",""" {"someTopic":{"0":438521}}, "someTopic":{"1":438705}}, "someTopic":{"2":254180}}""").
I am not sure if this is achievable
This is an old question at this point, so the OP likely got their answer, but when specifying offsets in JSON string format, you can use -2 for earliest and -1 for latest.
src
The start point when a query is started, either "earliest" which is from the earliest offsets, "latest" which is just from the latest offsets, or a json string specifying a starting offset for each TopicPartition. In the json, -2 as an offset can be used to refer to earliest, -1 to latest. Note: For batch queries, latest (either implicitly or by using -1 in json) is not allowed. For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at earliest.

Spark Streaming: Text data source supports only a single column

I am consuming Kafka data and then stream the data to HDFS.
The data stored in Kafka topic trial is like:
hadoop
hive
hive
kafka
hive
However, when I submit my codes, it returns:
Exception in thread "main"
org.apache.spark.sql.streaming.StreamingQueryException: Text data source supports only a single column, and you have 7 columns.;
=== Streaming Query ===
Identifier: [id = 2f3c7433-f511-49e6-bdcf-4275b1f1229a, runId = 9c0f7a35-118a-469c-990f-af00f55d95fb]
Current Committed Offsets: {KafkaSource[Subscribe[trial]]: {"trial":{"2":13,"1":13,"3":12,"0":13}}}
Current Available Offsets: {KafkaSource[Subscribe[trial]]: {"trial":{"2":13,"1":13,"3":12,"0":14}}}
My question is: as shown above, the data stored in Kafka comprises only ONE column, why the program says there are 7 columns ?
Any help is appreciated.
My spark-streaming codes:
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder.master("local[4]")
.appName("SpeedTester")
.config("spark.driver.memory", "3g")
.getOrCreate()
val ds = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "192.168.95.20:9092")
.option("subscribe", "trial")
.option("startingOffsets" , "earliest")
.load()
.writeStream
.format("text")
.option("path", "hdfs://192.168.95.21:8022/tmp/streaming/fixed")
.option("checkpointLocation", "/tmp/checkpoint")
.start()
.awaitTermination()
}
That is explained in the Structured Streaming + Kafka Integration Guide:
Each row in the source has the following schema:
Column Type
key binary
value binary
topic string
partition int
offset long
timestamp long
timestampType int
Which gives exactly seven columns. If you want to write only payload (value) select it and cast to string:
spark.readStream
...
.load()
.selectExpr("CAST(value as string)")
.writeStream
...
.awaitTermination()

multiple writeStream with spark streaming

I am working with spark streaming and I am facing some issues trying to implement multiple writestreams.
Below is my code
DataWriter.writeStreamer(firstTableData,"parquet",CheckPointConf.firstCheckPoint,OutputConf.firstDataOutput)
DataWriter.writeStreamer(secondTableData,"parquet",CheckPointConf.secondCheckPoint,OutputConf.secondDataOutput)
DataWriter.writeStreamer(thirdTableData,"parquet", CheckPointConf.thirdCheckPoint,OutputConf.thirdDataOutput)
where writeStreamer is defined as follows :
def writeStreamer(input: DataFrame, checkPointFolder: String, output: String) = {
val query = input
.writeStream
.format("orc")
.option("checkpointLocation", checkPointFolder)
.option("path", output)
.outputMode(OutputMode.Append)
.start()
query.awaitTermination()
}
the problem I am facing is that only the first table is written with spark writeStream , nothing happens for all other tables .
Do you have any idea about this please ?
query.awaitTermination() should be done after the last stream is created.
writeStreamer function can be modified to return a StreamingQuery and not awaitTermination at that point (as it is blocking):
def writeStreamer(input: DataFrame, checkPointFolder: String, output: String): StreamingQuery = {
input
.writeStream
.format("orc")
.option("checkpointLocation", checkPointFolder)
.option("path", output)
.outputMode(OutputMode.Append)
.start()
}
then you will have:
val query1 = DataWriter.writeStreamer(...)
val query2 = DataWriter.writeStreamer(...)
val query3 = DataWriter.writeStreamer(...)
query3.awaitTermination()
If you want to execute writers to run in parallel you can use
sparkSession.streams.awaitAnyTermination()
and remove query.awaitTermination() from writeStreamer method
By default the number of concurrent jobs is 1 which means at a time
only 1 job will be active
did you try increase number of possible concurent job in spark conf ?
sparkConf.set("spark.streaming.concurrentJobs","3")
not a offcial source : http://why-not-learn-something.blogspot.com/2016/06/spark-streaming-performance-tuning-on.html

Spark Streaming with Kafka ensure loss less processing

I have a very simple Spark + Kafka application. I'm reading from Kafka and printing in Console. I have 2 lines in below i.e Good-line and Bad-line
Initially I process with good line, and then I switch to bad-line for a while, when I change back to good line I expect to process from where it left off. Surprisingly it starts from latest.
1
2
3
missing
missing
7
8
9
In the below code how can I ensure I read all the messages. I did not find a code or place where I can control the offset. Even if there is a duplicate processing I'm fine .. coz I'll have unique-id in my message
public static void main(String[] args) throws Exception {
String brokers = "quickstart:9092";
String topics = "simple_topic_1";
String master = "local[*]";
SparkSession sparkSession = SparkSession
.builder().appName(SimpleKafkaProcessor.class.getName())
.master(master).getOrCreate();
SQLContext sqlContext = sparkSession.sqlContext();
SparkContext context = sparkSession.sparkContext();
context.setLogLevel("ERROR");
Dataset<Row> rawDataSet = sparkSession.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
//.option("enable.auto.commit", "false")
.option("auto.offset.reset", "earliest")
.option("group.id", "safe_message_landing_app_2")
.option("subscribe", topics).load();
rawDataSet.printSchema();
rawDataSet.createOrReplaceTempView("basicView");
// Good-Line
sqlContext.sql("select string(Value) as StrValue from basicView").writeStream()
// Bad-Line
//sqlContext.sql("select fieldNotFound as StrValue from basicView").writeStream()
.format("console")
.option("checkpointLocation", "cp/" + UUID.randomUUID().toString())
.trigger(ProcessingTime.create("15 seconds"))
.start()
.awaitTermination();
}

Resources