When i do below it is working fine
company_info_df.select(col("value"))
.writeStream()
.outputMode("append")
.option("truncate", false)
.format("console")
.trigger(Trigger.ProcessingTime("4 seconds"))
.start();
But when I do as below i.e. ".format("memory") , it is not showing anything
company_info_df.select(col("value"))
.writeStream()
.outputMode("append")
.option("truncate", false)
.format("memory")
.queryName("company_info")
.option("checkpointLocation", checkpointDir + "\\console")
.trigger(Trigger.ProcessingTime("4 seconds"))
.start();
Dataset<Row> company_inf = sparkSession.sql("select * from company_info");
company_inf.show();
What am I doing wrong here ?
what is the correct way for the same?
Refer to the below code in spark-shell that works for a sample data:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.appName("StructuredNetworkWordCount").getOrCreate()
import spark.implicits._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
val userSchema = new StructType().add("col1", "string").add("col2", "string").add("col3", "string").add("col4", "string").add("col5", "string").add("col6", "integer")
val csvDF = spark.readStream.option("sep", ",").schema(userSchema).csv("/user/Temp") //reads the stream as source files in a folder.
csvDF.createOrReplaceTempView("abcd");
val dbDf2 = spark.sql("select col2, sum(col6) from abcd group by col2");
dbDf2.writeStream.queryName("abcdquery").outputMode("complete").format("memory").start()
In your code, try removing some of the options during write operation and see what is going wrong.
Related
What's the best way to read, transform and write streams with different structures?
I have an application with two readstreams and writestreams from two different kafka topics. The two streams have different structures and follow different transformation processes. When writing the individual dataframe to console, I see only one stream in the console output, instead of two:
val df1 = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBroker)
.option("subscribe", "topic-event-1")
.load()
val df2 = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBroker)
.option("subscribe", "topic-event-2")
.load()
eventOneStream(df1)
eventTwoStream(df2)
def eventOneStream(dataframe: DataFrame): Unit = {
// do some transformations
//then write streams to console
dataframe
.writeStream
.outputMode("append")
.format("console")
.start().awaitTermination()
}
def eventTwoStream(dataframe: DataFrame): Unit = {
// do some other type of transformations
//then write streams to console
dataframe
.writeStream
.outputMode("append")
.format("console")
.start().awaitTermination()
}
You see only one stream because the first query is blocking the second query when you did awaitTermination(). What you can do is start both and then use StreamingQueryManager.awaitAnyTermination()
val query1 = eventOneStream(df1)
val query2 = eventTwoStream(df2)
spark.streams.awaitAnyTermination()
def eventOneStream(dataframe: DataFrame): StreamingQuery = {
// do some transformations
//then write streams to console
dataframe
.writeStream
.outputMode("append")
.format("console")
.start()
}
def eventTwoStream(dataframe: DataFrame): StreamingQuery = {
// do some other type of transformations
//then write streams to console
dataframe
.writeStream
.outputMode("append")
.format("console")
.start()
}
Using Spark 2.4.0
Confluent schema-Registry to receive schema
The message Key is serialized in String and Value in Avro, thus I am trying to de-serialize just the Value using io.confluent.kafka.serializers.KafkaAvroDeserializer, but it isn't working. Can anyone review my code to see whats wrong
libraries imported:
import io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient
import io.confluent.kafka.serializers.KafkaAvroDeserializer
import org.apache.avro.generic.GenericRecord
import org.apache.kafka.common.serialization.Deserializer
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{ Encoder, SparkSession}
Code Body
val topics = "test_topic"
val spark: SparkSession = SparkSession.builder
.config("spark.streaming.stopGracefullyOnShutdown", "true")
.config("spark.streaming.backpressure.enabled", "true")
.config("spark.streaming.kafka.maxRatePerPartition", 2170)
.config("spark.streaming.kafka.maxRetries", 1)
.config("spark.streaming.kafka.consumer.poll.ms", "600000")
.appName("SparkStructuredStreamAvro")
.config("spark.sql.streaming.checkpointLocation", "/tmp/new_checkpoint/")
.enableHiveSupport()
.getOrCreate
//add settings for schema registry url, used to get deser
val schemaRegUrl = "http://xx.xx.xx.xxx:xxxx"
val client = new CachedSchemaRegistryClient(schemaRegUrl, 100)
//subscribe to kafka
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "xx.xx.xxxx")
.option("subscribe", "test.topic")
.option("kafka.startingOffsets", "latest")
.option("group.id", "use_a_separate_group_id_for_each_stream")
.load()
//add confluent kafka avro deserializer, needed to read messages appropriately
val deser = new KafkaAvroDeserializer(client).asInstanceOf[Deserializer[GenericRecord]]
//needed to convert column select into Array[Bytes]
import spark.implicits._
val results = df.select(col("value").as[Array[Byte]]).map { rawBytes: Array[Byte] =>
//read the raw bytes from spark and then use the confluent deserializer to get the record back
val decoded = deser.deserialize(topics, rawBytes)
val recordId = decoded.get("nameId").asInstanceOf[org.apache.avro.util.Utf8].toString
recordId
}
results.writeStream
.outputMode("append")
.format("text")
.option("path", "/tmp/path_new/")
.option("truncate", "false")
.start()
.awaitTermination()
spark.stop()
It fails to deserialize, and Error Received is
Caused by: java.io.NotSerializableException: io.confluent.kafka.serializers.KafkaAvroDeserializer
Serialization stack:
- object not serializable (class: io.confluent.kafka.serializers.KafkaAvroDeserializer, value: io.confluent.kafka.serializers.KafkaAvroDeserializer#591024db)
- field (class: ca.bell.wireless.ingest$$anonfun$1, name: deser$1, type: interface org.apache.kafka.common.serialization.Deserializer)
- object (class ca.bell.wireless.ingest$$anonfun$1, <function1>)
- element of array (index: 1)
It works perfectly fine when I write a normal kafka consumer (not through spark) using
props.put("key.deserializer", classOf[StringDeserializer])
props.put("value.deserializer", classOf[KafkaAvroDeserializer])
You defined the variable('deser') for KafkaAvroDeserializer outside the map block.
it makes that exception.
Try to change the code like this:
val brdDeser = spark.sparkContext.broadcast(new KafkaAvroDeserializer(client).asInstanceOf[Deserializer[GenericRecord]])
val results = df.select(col("value").as[Array[Byte]]).map { rawBytes: Array[Byte] =>
val deser = brdDeser.value
val decoded = deser.deserialize(topics, rawBytes)
val recordId = decoded.get("nameId").asInstanceOf[org.apache.avro.util.Utf8].toString
recordId
}
We are trying to write spark streaming application that will write to the hdfs. However whenever we are writing the files lots of duplicates shows up. This behavior is with or without we crashing application using the kill. And also for both Dstream and Structured apis. The source is kafka topic. The behavior of checkpoint directory sounds very random. I have not come across very relevant information on the issue.
Question is: Can checkpoint directory provide exactly once behavior?
scala version: 2.11.8
spark version: 2.3.1.3.0.1.0-187
kafka version : 2.11-1.1.0
zookeeper version : 3.4.8-1
HDP : 3.1
Any help is appreciate.
Thanks,
Gautam
object sparkStructuredDownloading {
val kafka_brokers="kfk01.*.com:9092,kfk02.*.com:9092,kfk03.*.com:9092"
def main(args: Array[String]): Unit = {
var topic = args(0).trim().toString()
new downloadingAnalysis(kafka_brokers ,topic).process()
}
}
class downloadingAnalysis(brokers: String,topic: String) {
def process(): Unit = {
// try{
val spark = SparkSession.builder()
.appName("sparkStructuredDownloading")
// .appName("kafka_duplicate")
.getOrCreate()
spark.conf.set("spark.streaming.stopGracefullyOnShutdown", "true")
println("Application Started")
import spark.implicits._
import scala.concurrent.duration._
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import org.apache.spark.sql.streaming.Trigger
val inputDf = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("subscribe", topic)
.option("startingOffsets", "latest")
//.option("kafka.group.id", "testduplicate")
.load()
val personJsonDf = inputDf.selectExpr("CAST(value AS STRING)") //Converting binary to text
println("READ STREAM INITIATED")
import scala.concurrent.duration._
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import org.apache.spark.sql.streaming.Trigger
import spark.implicits._
val filteredDF= personJsonDf.filter(line=> new ParseLogs().validateLogLine(line.get(0).toString()))
spark.sqlContext.udf.register("parseLogLine", (logLine: String) => {
val df1 = filteredDF.selectExpr("parseLogLine(value) as result")
println(df1.schema)
println("WRITE STREAM INITIATED")
val checkpoint_loc="/warehouse/test_duplicate/download/checkpoint1"
val kafkaOutput = result.writeStream
.outputMode("append")
.format("orc")
.option("path", "/warehouse/test_duplicate/download/data1")
.option("maxRecordsPerFile", 10)
.trigger(Trigger.ProcessingTime("10 seconds"))
.start()
.awaitTermination()
}
I have this code:
import org.apache.spark.sql.SparkSession
object TopicIngester {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[*]") // remove this later
.appName("Ingester")
.getOrCreate()
spark.read
.format("kafka")
.option("kafka.bootstrap.servers", "sandbox-hdp.hortonworks.com:6667" /*my.cluster.com:9092 in case of cloudera*/)
.option("subscribe", "test")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.load()
.write
.parquet("/user/maria_dev/test")
spark.stop()
}
}
When I run it in the hortonworks sandbox everything works fine. All available data is read from the test topic and saved into the /user/maria_dev/test folder.
I also have a topic with the same name on my cloudera cluster, and for some reason it gets stuck at .parquet("/path/to/folder") and never finishes, as if it is waiting for more data forever or something.
What could be the problem?
Trying to test Spark Structured Streams ...and failing... how can I test them properly?
I followed the general Spark testing question from here, and my closest try was [1] looking something like:
import simpleSparkTest.SparkSessionTestWrapper
import org.scalatest.FunSpec
import org.apache.spark.sql.types.{StringType, IntegerType, DoubleType, StructType, DateType}
import org.apache.spark.sql.streaming.OutputMode
class StructuredStreamingSpec extends FunSpec with SparkSessionTestWrapper {
describe("Structured Streaming") {
it("Read file from system") {
val schema = new StructType()
.add("station_id", IntegerType)
.add("name", StringType)
.add("lat", DoubleType)
.add("long", DoubleType)
.add("dockcount", IntegerType)
.add("landmark", StringType)
.add("installation", DateType)
val sourceDF = spark.readStream
.option("header", "true")
.schema(schema)
.csv("/Spark-The-Definitive-Guide/data/bike-data/201508_station_data.csv")
.coalesce(1)
val countSource = sourceDF.count()
val query = sourceDF.writeStream
.format("memory")
.queryName("Output")
.outputMode(OutputMode.Append())
.start()
.processAllAvailable()
assert(countSource === 70)
}
}
}
Sadly it always fails with org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start()
I also found this issue at the spark-testing-base repo and wonder if it is even possible to test Spark Structured Streaming?
I want to have integration test and maybe even use Kafka on top for testing Checkpointing or specific corrupt data scenarios. Can someone help me out?
Last but not least, I figured the version maybe also a constraint - I currently develop against 2.1.0 which I need because of Azure HDInsight deployment options. Self hosted is an option if this is the drag.
Did you solve this?
You are doing a count() on a streaming dataframe before starting the execution by calling start().
If you want a count, how about doing this?
sourceDF.writeStream
.format("memory")
.queryName("Output")
.outputMode(OutputMode.Append())
.start()
.processAllAvailable()
val results: List[Row] = spark.sql("select * from Output").collectAsList()
assert(results.size() === 70)
You can also use the StructuredStreamingBase trait from #holdenk testing library :
https://github.com/holdenk/spark-testing-base/blob/936c34b6d5530eb664e7a9f447ed640542398d7e/core/src/test/2.2/scala/com/holdenkarau/spark/testing/StructuredStreamingSampleTests.scala
Here's an example on how to use it :
class StructuredStreamingTests extends FunSuite with SharedSparkContext with StructuredStreamingBase {
override implicit def reuseContextIfPossible: Boolean = true
test("add 3") {
import spark.implicits._
val input = List(List(1), List(2, 3))
val expected = List(4, 5, 6)
def compute(input: Dataset[Int]): Dataset[Int] = {
input.map(elem => elem + 3)
}
testSimpleStreamEndState(spark, input, expected, "append", compute)
}}