multiple writeStream with spark streaming - apache-spark

I am working with spark streaming and I am facing some issues trying to implement multiple writestreams.
Below is my code
DataWriter.writeStreamer(firstTableData,"parquet",CheckPointConf.firstCheckPoint,OutputConf.firstDataOutput)
DataWriter.writeStreamer(secondTableData,"parquet",CheckPointConf.secondCheckPoint,OutputConf.secondDataOutput)
DataWriter.writeStreamer(thirdTableData,"parquet", CheckPointConf.thirdCheckPoint,OutputConf.thirdDataOutput)
where writeStreamer is defined as follows :
def writeStreamer(input: DataFrame, checkPointFolder: String, output: String) = {
val query = input
.writeStream
.format("orc")
.option("checkpointLocation", checkPointFolder)
.option("path", output)
.outputMode(OutputMode.Append)
.start()
query.awaitTermination()
}
the problem I am facing is that only the first table is written with spark writeStream , nothing happens for all other tables .
Do you have any idea about this please ?

query.awaitTermination() should be done after the last stream is created.
writeStreamer function can be modified to return a StreamingQuery and not awaitTermination at that point (as it is blocking):
def writeStreamer(input: DataFrame, checkPointFolder: String, output: String): StreamingQuery = {
input
.writeStream
.format("orc")
.option("checkpointLocation", checkPointFolder)
.option("path", output)
.outputMode(OutputMode.Append)
.start()
}
then you will have:
val query1 = DataWriter.writeStreamer(...)
val query2 = DataWriter.writeStreamer(...)
val query3 = DataWriter.writeStreamer(...)
query3.awaitTermination()

If you want to execute writers to run in parallel you can use
sparkSession.streams.awaitAnyTermination()
and remove query.awaitTermination() from writeStreamer method

By default the number of concurrent jobs is 1 which means at a time
only 1 job will be active
did you try increase number of possible concurent job in spark conf ?
sparkConf.set("spark.streaming.concurrentJobs","3")
not a offcial source : http://why-not-learn-something.blogspot.com/2016/06/spark-streaming-performance-tuning-on.html

Related

Spark Structured Streaming not able to see the record details

I am trying to process the records from readstream and just try to print the row.
How ever in my driver logs or executor logs cant see any printed statements.
What might be wrong ?
For every record or batch( ideally) i want to print the message
for every batch , i want to execute a process.
val kafka = spark.readStream
.format("kafka")
.option("maxOffsetsPerTrigger", MAX_OFFSETS_PER_TRIGGER)
.option("kafka.bootstrap.servers", BOOTSTRAP_SERVERS)
.option("subscribe", topic) // comma separated list of topics
.option("startingOffsets", "earliest")
.option("checkpointLocation", CHECKPOINT_LOCATION)
.option("failOnDataLoss", "false")
.option("minPartitions", sys.env.getOrElse("MIN_PARTITIONS", "64").toInt)
.load()
import spark.implicits._
println("JSON output to write into sink")
val consoleOutput = kafka.selectExpr("CAST(key AS STRING) as key", "CAST(value AS STRING) as value")
//.select(from_json($"json", schema) as "data")
//.select("data.*")
//.select(get_json_object(($"value").cast("string"), "$").alias("body"))
.writeStream
.foreach(new ForeachWriter[Row] {
override def open(partitionId: Long, epochId: Long): Boolean = true
override def process(row: Row): Unit = {
logger.info(
s"Record received in data frame is -> " + row.mkString )
runProcess() // Want to run some process every microbatch
}
override def close(errorOrNull: Throwable): Unit = {}
})
.outputMode("append")
.format("console")
.trigger(Trigger.ProcessingTime("30 seconds"))
.start()
consoleOutput.awaitTermination()
}
I copied your code and it is running fine without the runProcess function call.
If you are planning to do two different things I recommend to have two separate queries after selecting the relevant fields from Kafka topic:
val kafkaSelection = kafka.selectExpr("CAST(key AS STRING) as key", "CAST(value AS STRING) as value")
1. For every record or batch( ideally) i want to print the message
val query1 = kafkaSelection
.writeStream
.outputMode("append")
.format("console")
.trigger(Trigger.ProcessingTime("30 seconds"))
.option("checkpointLocation", CHECKPOINT_LOCATION1)
.start()
2. for every batch , i want to execute a process.
val query2 = kafkaSelection
.writeStream
.foreach(new ForeachWriter[Row] {
override def open(partitionId: Long, epochId: Long): Boolean = true
override def process(row: Row): Unit = {
logger.info(
s"Record received in data frame is -> " + row.mkString )
runProcess() // Want to run some process every microbatch
}
override def close(errorOrNull: Throwable): Unit = {}
})
.outputMode("append")
.option("checkpointLocation", CHECKPOINT_LOCATION2)
.trigger(Trigger.ProcessingTime("30 seconds"))
.start()
Also note that I have set the checkpoint location for each query individually which will ensure a consistent tracking of the Kafka offsets. Make sure to have two different checkpoint location for each query. You can run both queries in parallel.
It is important to define both queries before waiting for their termination:
query1.awaitTermination()
query2.awaitTermination()
Tested with Spark 2.4.5:

Structured Streaming: Reading from multiple Kafka topics at once

I have a Spark Structured Streaming Application which has to read from 12 Kafka topics (Different Schemas, Avro format) at once, deserialize the data and store in HDFS. When I read from a single topic using my code, it works fine and without errors but on running multiple queries together, I'm getting the following error
java.lang.IllegalStateException: Race while writing batch 0
My code is as follows:
def main(args: Array[String]): Unit = {
val kafkaProps = Util.loadProperties(kafkaConfigFile).asScala
val topic_list = ("topic1", "topic2", "topic3", "topic4")
topic_list.foreach(x => {
kafkaProps.update("subscribe", x)
val source= Source.fromInputStream(Util.getInputStream("/schema/topics/" + x)).getLines.mkString
val schemaParser = new Schema.Parser
val schema = schemaParser.parse(source)
val sqlTypeSchema = SchemaConverters.toSqlType(schema).dataType.asInstanceOf[StructType]
val kafkaStreamData = spark
.readStream
.format("kafka")
.options(kafkaProps)
.load()
val udfDeserialize = udf(deserialize(source), DataTypes.createStructType(sqlTypeSchema.fields))
val transformedDeserializedData = kafkaStreamData.select("value").as(Encoders.BINARY)
.withColumn("rows", udfDeserialize(col("value")))
.select("rows.*")
val query = transformedDeserializedData
.writeStream
.trigger(Trigger.ProcessingTime("5 seconds"))
.outputMode("append")
.format("parquet")
.option("path", "/output/topics/" + x)
.option("checkpointLocation", checkpointLocation + "//" + x)
.start()
})
spark.streams.awaitAnyTermination()
}
Alternative. You can use KAFKA Connect (from Confluent), NIFI, StreamSets etc. as your use case seems to fit "dump/persist to HDFS". That said, you need to have these tools (installed). The small files problem you state is not an issue, so be it.
From Apache Kafka 0.9 or later version you can Kafka Connect API for KAFKA --> HDFS Sink (various supported HDFS formats). You need a KAFKA Connect Cluster though, but that is based on your existing Cluster in any event, so not a big deal. But someone needs to maintain.
Some links to get you on your way:
https://data-flair.training/blogs/kafka-connect/
https://github.com/confluentinc/kafka-connect-hdfs

How to include both "latest" and "JSON with specific Offset" in "startingOffsets" while importing data from Kafka into Spark Structured Streaming

I have a streaming query saving data into filesink. I am using .option("startingOffsets", "latest") and a checkpoint location. If there is any down time on Spark and when the streaming query starts again i do not want to start processing where the query left off when it went down rather than this scenario i would also like to add ("startingOffsets", """ {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}} """) by specifying the user defined offset which needs to process from.
i tried doing this with different programs but i need to achieve this in one single program
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.Trigger
object OSB_offset_kafkaToSpark {
def main(args: Array[String]): Unit = {
val spark = SparkSession.
builder().
appName("OSB_kafkaToSpark").
config("spark.mongodb.output.uri", "spark.mongodb.output.uri=mongodb://somemongodb.com:27018").
getOrCreate()
println("SparkSession -> "+spark)
import spark.implicits._
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "somekafkabroker:9092, somekafkabroker:9092")
.option("subscribe", "someTopic")
.option("startingOffsets", "latest")
.option("startingOffsets",""" {"someTopic":{"0":438521}}, "someTopic":{"1":438705}}, "someTopic":{"2":254180}}""")
.option("endingOffsets",""" {"someTopic":{"0":-1}}, "someTopic":{"1":-1}}, "someTopic":{"2":-1}} """)
.option("failOnDataLoss", "false")
.load()
val dfs = df.selectExpr("CAST(value AS STRING)")
val data = dfs.withColumn("splitted", split($"value", "/"))
.select($"splitted".getItem(4).alias("region"), $"splitted".getItem(5).alias("service"), col("value"))
.withColumn("service_type", regexp_extract($"service", """.*(Inbound|Outbound|Outound).*""", 1))
.withColumn("region_type", concat(
when(col("region").isNotNull, col("region")).otherwise(lit("null")), lit(" "),
when(col("service").isNotNull, col("service_type")).otherwise(lit("null"))))
.withColumn("datetime", regexp_extract($"value", """\d{4}-[01]\d-[0-3]\d [0-2]\d:[0-5]\d:[0-5]\d""", 0))
val extractedDF = data.filter(
col("region").isNotNull &&
col("service").isNotNull &&
col("value").isNotNull &&
col("service_type").isNotNull &&
col("region_type").isNotNull &&
col("datetime").isNotNull)
.filter("region != ''")
.filter("service != ''")
.filter("value != ''")
.filter("service_type != ''")
.filter("region_type != ''")
.filter("datetime != ''")
val pathstring = "/user/spark_streaming".concat(args(0))
val query = extractedDF.writeStream
.format("json")
.option("path", pathstring)
.option("checkpointLocation", "/user/some_checkpoint")
.outputMode("append")
.trigger(Trigger.ProcessingTime("5 seconds"))
.start()
query.awaitTermination()
}
}
I need run a single program with both .option("startingOffsets", "latest") and .option("startingOffsets",""" {"someTopic":{"0":438521}}, "someTopic":{"1":438705}}, "someTopic":{"2":254180}}""").
I am not sure if this is achievable
This is an old question at this point, so the OP likely got their answer, but when specifying offsets in JSON string format, you can use -2 for earliest and -1 for latest.
src
The start point when a query is started, either "earliest" which is from the earliest offsets, "latest" which is just from the latest offsets, or a json string specifying a starting offset for each TopicPartition. In the json, -2 as an offset can be used to refer to earliest, -1 to latest. Note: For batch queries, latest (either implicitly or by using -1 in json) is not allowed. For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at earliest.

Spark Streaming: Text data source supports only a single column

I am consuming Kafka data and then stream the data to HDFS.
The data stored in Kafka topic trial is like:
hadoop
hive
hive
kafka
hive
However, when I submit my codes, it returns:
Exception in thread "main"
org.apache.spark.sql.streaming.StreamingQueryException: Text data source supports only a single column, and you have 7 columns.;
=== Streaming Query ===
Identifier: [id = 2f3c7433-f511-49e6-bdcf-4275b1f1229a, runId = 9c0f7a35-118a-469c-990f-af00f55d95fb]
Current Committed Offsets: {KafkaSource[Subscribe[trial]]: {"trial":{"2":13,"1":13,"3":12,"0":13}}}
Current Available Offsets: {KafkaSource[Subscribe[trial]]: {"trial":{"2":13,"1":13,"3":12,"0":14}}}
My question is: as shown above, the data stored in Kafka comprises only ONE column, why the program says there are 7 columns ?
Any help is appreciated.
My spark-streaming codes:
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder.master("local[4]")
.appName("SpeedTester")
.config("spark.driver.memory", "3g")
.getOrCreate()
val ds = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "192.168.95.20:9092")
.option("subscribe", "trial")
.option("startingOffsets" , "earliest")
.load()
.writeStream
.format("text")
.option("path", "hdfs://192.168.95.21:8022/tmp/streaming/fixed")
.option("checkpointLocation", "/tmp/checkpoint")
.start()
.awaitTermination()
}
That is explained in the Structured Streaming + Kafka Integration Guide:
Each row in the source has the following schema:
Column Type
key binary
value binary
topic string
partition int
offset long
timestamp long
timestampType int
Which gives exactly seven columns. If you want to write only payload (value) select it and cast to string:
spark.readStream
...
.load()
.selectExpr("CAST(value as string)")
.writeStream
...
.awaitTermination()

Count number of records written to Hive table in Spark Structured Streaming

I have this code.
val query = event_stream
.selectExpr("CAST(key AS STRING)", "CAST(value AS .select(from_json($"value", schema_simple).as("data"))
.select("data.*")
.writeStream
.outputMode("append")
.format("orc")
.option("path", "hdfs:***********")
//.option("path", "/tmp/orc")
.option("checkpointLocation", "hdfs:**********/")
.start()
println("###############" + query.isActive)
query.awaitTermination()
I want to count the number of records inserted into Hive.
What are the options available? And how to do it?
I found SparkEventListener TaskEnd. I'm not sure if it would work for a streaming source. I tried it, it's not working as of now.
One approach I thought was to make hiveReader and then count the number of records in the stream.

Resources