Multiple Spark streaming readstream & writestream - apache-spark

What's the best way to read, transform and write streams with different structures?
I have an application with two readstreams and writestreams from two different kafka topics. The two streams have different structures and follow different transformation processes. When writing the individual dataframe to console, I see only one stream in the console output, instead of two:
val df1 = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBroker)
.option("subscribe", "topic-event-1")
.load()
val df2 = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBroker)
.option("subscribe", "topic-event-2")
.load()
eventOneStream(df1)
eventTwoStream(df2)
def eventOneStream(dataframe: DataFrame): Unit = {
// do some transformations
//then write streams to console
dataframe
.writeStream
.outputMode("append")
.format("console")
.start().awaitTermination()
}
def eventTwoStream(dataframe: DataFrame): Unit = {
// do some other type of transformations
//then write streams to console
dataframe
.writeStream
.outputMode("append")
.format("console")
.start().awaitTermination()
}

You see only one stream because the first query is blocking the second query when you did awaitTermination(). What you can do is start both and then use StreamingQueryManager.awaitAnyTermination()
val query1 = eventOneStream(df1)
val query2 = eventTwoStream(df2)
spark.streams.awaitAnyTermination()
def eventOneStream(dataframe: DataFrame): StreamingQuery = {
// do some transformations
//then write streams to console
dataframe
.writeStream
.outputMode("append")
.format("console")
.start()
}
def eventTwoStream(dataframe: DataFrame): StreamingQuery = {
// do some other type of transformations
//then write streams to console
dataframe
.writeStream
.outputMode("append")
.format("console")
.start()
}

Related

Spark structure streaming write operations into couchbase?

I have tried write data into couchbase using structure streaming, i referred spark connector 2.0, i declared couchbase connection in spark session, But am not able to connect with cocuhbase,below examble i tried , Any help will be appreciate thanks in advance
SparkSession spark = SparkSession
.builder()
.appName("couch")
.master("local[*]")
.config("spark.couchbase.nodes", "127.0.0.1")
.config("spark.couchbase.bucket.mybucket", "123")
.config("com.couchbase.username", "Administrator")
.config("com.couchbase.password", "password")
.getOrCreate()
Dataset<row> df=....//dataset reading data from kafka
StreamingQuery query = df.writeStream
.outputMode("complete")
.option("checkpointLocation", "mycheckpointlocation")
.option("idField", "value")
.format("com.couchbase.spark.sql")
.start();
query.awaitTermination();

spark-structured steaming not displaying any data in "format("memory")

When i do below it is working fine
company_info_df.select(col("value"))
.writeStream()
.outputMode("append")
.option("truncate", false)
.format("console")
.trigger(Trigger.ProcessingTime("4 seconds"))
.start();
But when I do as below i.e. ".format("memory") , it is not showing anything
company_info_df.select(col("value"))
.writeStream()
.outputMode("append")
.option("truncate", false)
.format("memory")
.queryName("company_info")
.option("checkpointLocation", checkpointDir + "\\console")
.trigger(Trigger.ProcessingTime("4 seconds"))
.start();
Dataset<Row> company_inf = sparkSession.sql("select * from company_info");
company_inf.show();
What am I doing wrong here ?
what is the correct way for the same?
Refer to the below code in spark-shell that works for a sample data:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.appName("StructuredNetworkWordCount").getOrCreate()
import spark.implicits._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
val userSchema = new StructType().add("col1", "string").add("col2", "string").add("col3", "string").add("col4", "string").add("col5", "string").add("col6", "integer")
val csvDF = spark.readStream.option("sep", ",").schema(userSchema).csv("/user/Temp") //reads the stream as source files in a folder.
csvDF.createOrReplaceTempView("abcd");
val dbDf2 = spark.sql("select col2, sum(col6) from abcd group by col2");
dbDf2.writeStream.queryName("abcdquery").outputMode("complete").format("memory").start()
In your code, try removing some of the options during write operation and see what is going wrong.

Spark structured and Dstream application is writing duplicates

We are trying to write spark streaming application that will write to the hdfs. However whenever we are writing the files lots of duplicates shows up. This behavior is with or without we crashing application using the kill. And also for both Dstream and Structured apis. The source is kafka topic. The behavior of checkpoint directory sounds very random. I have not come across very relevant information on the issue.
Question is: Can checkpoint directory provide exactly once behavior?
scala version: 2.11.8
spark version: 2.3.1.3.0.1.0-187
kafka version : 2.11-1.1.0
zookeeper version : 3.4.8-1
HDP : 3.1
Any help is appreciate.
Thanks,
Gautam
object sparkStructuredDownloading {
val kafka_brokers="kfk01.*.com:9092,kfk02.*.com:9092,kfk03.*.com:9092"
def main(args: Array[String]): Unit = {
var topic = args(0).trim().toString()
new downloadingAnalysis(kafka_brokers ,topic).process()
}
}
class downloadingAnalysis(brokers: String,topic: String) {
def process(): Unit = {
// try{
val spark = SparkSession.builder()
.appName("sparkStructuredDownloading")
// .appName("kafka_duplicate")
.getOrCreate()
spark.conf.set("spark.streaming.stopGracefullyOnShutdown", "true")
println("Application Started")
import spark.implicits._
import scala.concurrent.duration._
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import org.apache.spark.sql.streaming.Trigger
val inputDf = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("subscribe", topic)
.option("startingOffsets", "latest")
//.option("kafka.group.id", "testduplicate")
.load()
val personJsonDf = inputDf.selectExpr("CAST(value AS STRING)") //Converting binary to text
println("READ STREAM INITIATED")
import scala.concurrent.duration._
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import org.apache.spark.sql.streaming.Trigger
import spark.implicits._
val filteredDF= personJsonDf.filter(line=> new ParseLogs().validateLogLine(line.get(0).toString()))
spark.sqlContext.udf.register("parseLogLine", (logLine: String) => {
val df1 = filteredDF.selectExpr("parseLogLine(value) as result")
println(df1.schema)
println("WRITE STREAM INITIATED")
val checkpoint_loc="/warehouse/test_duplicate/download/checkpoint1"
val kafkaOutput = result.writeStream
.outputMode("append")
.format("orc")
.option("path", "/warehouse/test_duplicate/download/data1")
.option("maxRecordsPerFile", 10)
.trigger(Trigger.ProcessingTime("10 seconds"))
.start()
.awaitTermination()
}

Why does consuming from kafka not finish in cloudera but finishes in hortonworks?

I have this code:
import org.apache.spark.sql.SparkSession
object TopicIngester {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[*]") // remove this later
.appName("Ingester")
.getOrCreate()
spark.read
.format("kafka")
.option("kafka.bootstrap.servers", "sandbox-hdp.hortonworks.com:6667" /*my.cluster.com:9092 in case of cloudera*/)
.option("subscribe", "test")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.load()
.write
.parquet("/user/maria_dev/test")
spark.stop()
}
}
When I run it in the hortonworks sandbox everything works fine. All available data is read from the test topic and saved into the /user/maria_dev/test folder.
I also have a topic with the same name on my cloudera cluster, and for some reason it gets stuck at .parquet("/path/to/folder") and never finishes, as if it is waiting for more data forever or something.
What could be the problem?

Resource not found exception using kinesis spark streaming

I am trying to use spark structured streaming with kinesis. Following is my code.
val kinesis = spark
.readStream
.format("kinesis")
.option("streams", streamName)
.option("region", "us-east-1")
.option("initialPosition", "TRIM_HORIZON")
.option("endpointUrl", "kinesis.us-east-1.amazonaws.com")
.option("awsAccessKey", accessKey)
.option("awsSecretKey", secretKey)
.option("format", "json")
.option("inferSchema", "true")
.schema(schema)
.load
val query = kinesis.writeStream
.outputMode("append")
.format("console")
.start()
query.awaitTermination()
I also exported accesskey and secret key as environment variables. But when I execute it. I have following exception
ERROR StreamExecution: Query [id = 45b3e45a-6a3b-48c6-9183-8531f85ea537, runId = ab5bbfd6-d9f7-455c-8065-fa65de932914] terminated with error
com.amazonaws.services.kinesis.model.ResourceNotFoundException: Stream dev-rt-stream-ds-5 under account 438156723281 not found. (Service: AmazonKinesis; Status Code: 400; Error Code: ResourceNotFoundException; Request ID: f1b1384c-2528-e50f-a358-2e59f9ad3b45)
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1182)
Following code returns correct number of shards.
val credentials = new BasicAWSCredentials(accessKey, secretKey)
val kinesisClient = new AmazonKinesisClient(credentials)
kinesisClient.setEndpoint("kinesis.us-east-1.amazonaws.com")
val numShards = kinesisClient.describeStream(streamName).getStreamDescription().getShards().size
println("num shards => " + numShards)

Resources