How to refresh loaded dataframe contents in spark streaming? - apache-spark

Using spark-sql 2.4.1 and kafka for real time streaming.
I have following use case
Need to load a meta-data from hdfs for joining with streaming dataframe from kafka.
streaming data record's particular columns should be looked up in meta-data dataframe particular colums(col-X) data.
If found pick meta-data column(col-Y) data
Else not found , insert streaming record/column data into meta-data dataframe i.e. into hdfs. I.e. it should be looked up if
streaming dataframe contain same data again.
As meta-data loaded at the beginning of the spark job how to refresh its contents again in the streaming-job to lookup and join with another streaming dataframe ?

I may have misunderstood the question, but refreshing the metadata dataframe should be a feature supported out of the box.
You simply don't have to do anything.
Let's have a look at the example:
// a batch dataframe
val metadata = spark.read.text("metadata.txt")
scala> metadata.show
+-----+
|value|
+-----+
|hello|
+-----+
// a streaming dataframe
val stream = spark.readStream.text("so")
// join on the only value column
stream.join(metadata, "value").writeStream.format("console").start
As long as the content of the files in so directory matches metadata.txt file you should get a dataframe printed out to the console.
-------------------------------------------
Batch: 1
-------------------------------------------
+-----+
|value|
+-----+
|hello|
+-----+
Change metadata.txt to, say, world and only worlds from new files get matched.

EDIT This solution is more elaborate and would work (for all use cases).
For simpler cases where the data is appended to existing files without changing the files or read from the databases simpler solution can be used as pointed out in the other answer.
This is because the dataframe (and underlying RDD) partitions are created once and the data is read everytime the datafframe is used. (unless it is cached by spark)
If can afford it you can try to (re)read this meta-data dataframe in every micro-bacth.
A better approach would be to put the meta-data dataframe in a cache (not to be confused with spark caching the dataframe). A cache is similar to a map except that it will not not give entries inserted more than the configured time-to-live duration.
In your code you'll try to fetch this meta-data dataframe from the cache once for every micro batch. If the cache return null. You'll read the data frame again, put into cache and then use the dataframe.
The Cache class would be
import scala.collection.mutable
// cache class to store the dataframe
class Cache[K, V](timeToLive: Long) extends mutable.Map[K, V] {
private var keyValueStore = mutable.HashMap[K, (V, Long)]()
override def get(key: K):Option[V] = {
keyValueStore.get(key) match {
case Some((value, insertedAt)) if insertedAt+timeToLive > System.currentTimeMillis => Some(value)
case _ => None
}
}
override def iterator: Iterator[(K, V)] = keyValueStore.iterator
.filter({
case (key, (value, insertedAt)) => insertedAt+timeToLive > System.currentTimeMillis
}).map(x => (x._1, x._2._1))
override def -=(key: K): this.type = {
keyValueStore-=key
this
}
override def +=(kv: (K, V)): this.type = {
keyValueStore += ((kv._1, (kv._2, System.currentTimeMillis())))
this
}
}
The logic to access the meta-data dataframe through the cache
import org.apache.spark.sql.DataFrame
object DataFrameCache {
lazy val cache = new Cache[String, DataFrame](600000) // ten minutes timeToLive
def readMetaData: DataFrame = ???
def getMetaData: DataFrame = {
cache.get("metadataDF") match {
case Some(df) => df
case None => {
val metadataDF = readMetaData
cache.put("metadataDF", metadataDF)
metadataDF
}
}
}
}

Below is the scenario which I followed in spark 2.4.5 for left outer join with stream join.Below process is pushing spark to read latest dimension data changes.
Process is for Stream Join with batch dimension (always update)
Step 1:-
Before starting Spark streaming job:-
Make sure dimension batch data folder has only one file and the file should have at-least one record (for some reason placing empty file is not working).
Step 2:-
Start your streaming job and add a stream record in kafka stream
Step 3:-
Overwrite dim data with values (the file should be same name don't change and the dimension folder should have only one file)
Note:- don't use spark to write to this folder use Java or Scala filesystem.io to overwrite the file or bash delete the file and replace with new data file with same name.
Step 4:-
In next batch spark is able to read updated dimension data while joining with kafka stream...
Sample Code:-
package com.broccoli.streaming.streamjoinupdate
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.types.{StringType, StructField, StructType, TimestampType}
import org.apache.spark.sql.{DataFrame, SparkSession}
object BroadCastStreamJoin3 {
def main(args: Array[String]): Unit = {
#transient lazy val logger: Logger = Logger.getLogger(getClass.getName)
Logger.getLogger("akka").setLevel(Level.WARN)
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("com.amazonaws").setLevel(Level.ERROR)
Logger.getLogger("com.amazon.ws").setLevel(Level.ERROR)
Logger.getLogger("io.netty").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.master("local")
.getOrCreate()
val schemaUntyped1 = StructType(
Array(
StructField("id", StringType),
StructField("customrid", StringType),
StructField("customername", StringType),
StructField("countrycode", StringType),
StructField("timestamp_column_fin_1", TimestampType)
))
val schemaUntyped2 = StructType(
Array(
StructField("id", StringType),
StructField("countrycode", StringType),
StructField("countryname", StringType),
StructField("timestamp_column_fin_2", TimestampType)
))
val factDf1 = spark.readStream
.schema(schemaUntyped1)
.option("header", "true")
.csv("src/main/resources/broadcasttest/fact")
val dimDf3 = spark.read
.schema(schemaUntyped2)
.option("header", "true")
.csv("src/main/resources/broadcasttest/dimension")
.withColumnRenamed("id", "id_2")
.withColumnRenamed("countrycode", "countrycode_2")
import spark.implicits._
factDf1
.join(
dimDf3,
$"countrycode_2" <=> $"countrycode",
"inner"
)
.writeStream
.format("console")
.outputMode("append")
.start()
.awaitTermination
}
}
Thanks
Sri

Related

Spark dataframe lose streaming capability after accessing Kafka source

I use Spark 2.4.3 and Kafka 2.3.0. I want to do Spark structured streaming with data coming from Kafka to Spark. In general it does work in the test mode but since I have to do some processing of the data (and do not know another way to do) the Spark data frames do not have the streaming capability anymore.
#!/usr/bin/env python3
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json
from pyspark.sql.types import StructField, StructType, StringType, DoubleType
# create schema for data
schema = StructType([StructField("Signal", StringType()),StructField("Value", DoubleType())])
# create spark session
spark = SparkSession.builder.appName("streamer").getOrCreate()
# create DataFrame representing the stream
dsraw = spark.readStream \
.format("kafka").option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "test")
print("dsraw.isStreaming: ", dsraw.isStreaming)
# Convert Kafka stream to something readable
ds = dsraw.selectExpr("CAST(value AS STRING)")
print("ds.isStreaming: ", ds.isStreaming)
# Do query on the converted data
dsQuery = ds.writeStream.queryName("ds_query").format("memory").start()
df1 = spark.sql("select * from ds_query")
print("df1.isStreaming: ", df1.isStreaming)
# convert json into spark dataframe cols
df2 = df1.withColumn("value", from_json("value", schema))
print("df2.isStreaming: ", df2.isStreaming)
The output is:
dsraw.isStreaming: True
ds.isStreaming: True
df1.isStreaming: False
df2.isStreaming: False
So I lose the streaming capability when I create the first dataframe. How can I avoid it? How do I get a streaming Spark dataframe out of a stream?
It is not recommend to use the memory sink for production applications as all the data will be stored in the driver.
There is also no reason to do this, except for debugging purposes, as you can process your streaming dataframes like the 'normal' dataframes. For example:
import pyspark.sql.functions as F
lines = spark.readStream.format("socket").option("host", "XXX.XXX.XXX.XXX").option("port", XXXXX).load()
words = lines.select(lines.value)
words = words.filter(words.value.startswith('h'))
wordCounts = words.groupBy("value").count()
wordCounts = wordCounts.withColumn('count', F.col('count') + 2)
query = wordCounts.writeStream.queryName("test").outputMode("complete").format("memory").start()
In case you still want to go with your approach: Even if df.isStreaming tells you it is not a streaming dataframe (which is correct), the underlying datasource is a stream and the dataframe will therefore grow with each processed batch.

Call a function with each element a stream in Databricks

I have a DataFrame stream in Databricks, and I want to perform an action on each element. On the net I found specific purpose methods, like writing it to the console or dumping into memory, but I want to add some business logic, and put some results into Redis.
To be more specific, this is how it would look like in non-stream case:
val someDataFrame = Seq(
("key1", "value1"),
("key2", "value2"),
("key3", "value3"),
("key4", "value4")
).toDF()
def someFunction(keyValuePair: (String, String)) = {
println(keyValuePair)
}
someDataFrame.collect.foreach(r => someFunction((r(0).toString, r(1).toString)))
But if the someDataFrame is not a simple data frame but a stream data frame (indeed coming from Kafka), the error message is this:
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
Could anyone please help me solving this problem?
Some important notes:
I've read the relevant documentation, like Spark Streaming or Databricks Streaming and a few other descriptions as well.
I know that there must be something like start() and awaitTermination, but I don't know the exact syntax. The descriptions did not help.
It would take pages to list all the possibilities I tried, so I rather not provide them.
I do not want to solve the specific problem of displaying the result. I.e. please do not provide a solution to this specific case. The someFunction would look like this:
val someData = readSomeExternalData()
if (condition containing keyValuePair and someData) {
doSomething(keyValuePair);
}
(Question What is the purpose of ForeachWriter in Spark Structured Streaming? does not provide a working example, therefore does not answer my question.)
Here is an example of reading using foreachBatch to save every item to redis using the streaming api.
Related to a previous question (DataFrame to RDD[(String, String)] conversion)
// import spark and spark-redis
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.streaming._
import org.apache.spark.sql.types._
import com.redislabs.provider.redis._
// schema of csv files
val userSchema = new StructType()
.add("name", "string")
.add("age", "string")
// create a data stream reader from a dir with csv files
val csvDF = spark
.readStream
.format("csv")
.option("sep", ";")
.schema(userSchema)
.load("./data") // directory where the CSV files are
// redis
val redisConfig = new RedisConfig(new RedisEndpoint("localhost", 6379))
implicit val readWriteConfig: ReadWriteConfig = ReadWriteConfig.Default
csvDF.map(r => (r.getString(0), r.getString(0))) // converts the dataset to a Dataset[(String, String)]
.writeStream // create a data stream writer
.foreachBatch((df, _) => sc.toRedisKV(df.rdd)(redisConfig)) // save each batch to redis after converting it to a RDD
.start // start processing
Call simple user defined function foreachbatch in spark streaming.
please try this,
it will print 'hello world' for every message from tcp socket
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession .builder .appName("StructuredNetworkWordCount") .getOrCreate()
# Create DataFrame representing the stream of input lines from connection tolocalhost:9999
lines = spark .readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load()
# Split the lines into words
words = lines.select(
explode(
split(lines.value, " ")
).alias("word")
)
# Generate running word count
wordCounts = words.groupBy("word").count()
# Start running the query that prints the running counts to the console
def process_row(df, epoch_id):
# # Write row to storage
print('hello world')
query = words.writeStream.foreachBatch(process_row).start()
#query = wordCounts .writeStream .outputMode("complete") .format("console") .start()
query.awaitTermination()

How to use foreachRDD in legacy Spark Streaming

I am getting exception while using foreachRDD for my CSV data processing. Here is my code
case class Person(name: String, age: Long)
val conf = new SparkConf()
conf.setMaster("local[*]")
conf.setAppName("CassandraExample").set("spark.driver.allowMultipleContexts", "true")
val ssc = new StreamingContext(conf, Seconds(10))
val smDstream=ssc.textFileStream("file:///home/sa/testFiles")
smDstream.foreachRDD((rdd,time) => {
val peopleDF = rdd.map(_.split(",")).map(attributes =>
Person(attributes(0), attributes(1).trim.toInt)).toDF()
peopleDF.createOrReplaceTempView("people")
val teenagersDF = spark.sql("insert into table devDB.stam SELECT name, age
FROM people WHERE age BETWEEN 13 AND 29")
//teenagersDF.show
})
ssc.checkpoint("hdfs://go/hive/warehouse/devDB.db")
ssc.start()
i am getting following error
java.io.NotSerializableException: DStream checkpointing has been enabled but the DStreams with their functions are not serializable
org.apache.spark.streaming.StreamingContext
Serialization stack:
- object not serializable (class: org.apache.spark.streaming.StreamingContext, value: org.apache.spark.streaming.StreamingContext#1263422a)
- field (class: $iw, name: ssc, type: class org.apache.spark.streaming.StreamingContext)
please help
The question does not really make sense anymore in that dStreams are being deprecated / abandoned.
There a few things to consider in the code, what the exact question is therefore hard to glean. That said, I had to ponder as well as I am not a Serialization expert.
You can find a few posts of some trying to write to Hive table directly as opposed to a path, in my answer I use an approach but you can use your approach of Spark SQL to write for a TempView, that is all possible.
I simulated input from a QueueStream, so I need no split to be applied. You can adapt this to your own situation if you follow the same "global" approach. I elected to write to a parquet file that gets created if needed. You can create your tempView and then use spark.sql as per your initial approach.
The Output Operations on DStreams are:
print()
saveAsTextFiles(prefix, [suffix])
saveAsObjectFiles(prefix, [suffix])
saveAsHadoopFiles(prefix, [suffix])
foreachRDD(func)
foreachRDD
The most generic output operator that applies a function, func, to
each RDD generated from the stream. This function should push the data
in each RDD to an external system, such as saving the RDD to files, or
writing it over the network to a database. Note that the function func
is executed in the driver process running the streaming application,
and will usually have RDD actions in it that will force the
computation of the streaming RDDs.
It states saving to files, but it can do what you want via foreachRDD, albeit I
assumed the idea was to external systems. Saving to files is quicker
in my view as opposed to going through steps to write a table
directly. You want to offload data asap with Streaming as volumes are typically high.
Two steps:
In a separate class to the Streaming Class - run under Spark 2.4:
case class Person(name: String, age: Int)
Then the Streaming logic you need to apply - you may need some imports
that I have in my notebook otherwise as I ran this under DataBricks:
import org.apache.spark.sql.SparkSession
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable
import org.apache.spark.sql.SaveMode
val spark = SparkSession
.builder
.master("local[4]")
.config("spark.driver.cores", 2)
.appName("forEachRDD")
.getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(spark.sparkContext, Seconds(1))
val rddQueue = new mutable.Queue[RDD[List[(String, Int)]]]()
val QS = ssc.queueStream(rddQueue)
QS.foreachRDD(q => {
if(!q.isEmpty) {
val q_flatMap = q.flatMap{x=>x}
val q_withPerson = q_flatMap.map(field => Person(field._1, field._2))
val df = q_withPerson.toDF()
df.write
.format("parquet")
.mode(SaveMode.Append)
.saveAsTable("SO_Quest_BigD")
}
}
)
ssc.start()
for (c <- List(List(("Fred",53), ("John",22), ("Mary",76)), List(("Bob",54), ("Johnny",92), ("Margaret",15)), List(("Alfred",21), ("Patsy",34), ("Sylvester",7)) )) {
rddQueue += ssc.sparkContext.parallelize(List(c))
}
ssc.awaitTermination()

Refresh Dataframe in Spark real-time Streaming without stopping process

in my application i get a stream of accounts from Kafka queue (using Spark streaming with kafka)
And i need to fetch attributes related to these accounts from S3 so im planning to cache S3 resultant dataframe as the S3 data will not updated atleast for a day for now, it might change to 1hr or 10 mins very soon in future .So the question is how can i refresh the cached dataframe periodically without stopping process.
**Update:Im planning to publish an event into kafka whenever there is an update in S3, using SNS and AWS lambda and my streaming application will subscribe to the event and refresh the cached dataframe based on this event (basically unpersist()cache and reload from S3)
Is this a good approach ?
This question was recently asked on the Spark Mailing List
As far as I know the only way to do what you're asking is to reload the DataFrame from S3 when new data arrives which means you have to recreate the streaming DF as well and restart the query. This is because DataFrames are fundamentally immutable.
If you want to update (mutate) data in a DataFrame without reloading it, you need to try one of the datastores that integrate with or connect to Spark and allow mutations. One that I'm aware of is SnappyData.
Simplest way to achieve , below code reads dimension data folder for every batch but do keep in mind new dimension data values (country names in my case) have to be a new file.
package com.databroccoli.streaming.dimensionupateinstreaming
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.{DataFrame, ForeachWriter, Row, SparkSession}
import org.apache.spark.sql.functions.{broadcast, expr}
import org.apache.spark.sql.types.{StringType, StructField, StructType, TimestampType}
object RefreshDimensionInStreaming {
def main(args: Array[String]) = {
#transient lazy val logger: Logger = Logger.getLogger(getClass.getName)
Logger.getLogger("akka").setLevel(Level.WARN)
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("com.amazonaws").setLevel(Level.ERROR)
Logger.getLogger("com.amazon.ws").setLevel(Level.ERROR)
Logger.getLogger("io.netty").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.master("local")
.getOrCreate()
val schemaUntyped1 = StructType(
Array(
StructField("id", StringType),
StructField("customrid", StringType),
StructField("customername", StringType),
StructField("countrycode", StringType),
StructField("timestamp_column_fin_1", TimestampType)
))
val schemaUntyped2 = StructType(
Array(
StructField("id", StringType),
StructField("countrycode", StringType),
StructField("countryname", StringType),
StructField("timestamp_column_fin_2", TimestampType)
))
val factDf1 = spark.readStream
.schema(schemaUntyped1)
.option("header", "true")
.csv("src/main/resources/broadcasttest/fact")
var countryDf: Option[DataFrame] = None: Option[DataFrame]
def updateDimensionDf() = {
val dimDf2 = spark.read
.schema(schemaUntyped2)
.option("header", "true")
.csv("src/main/resources/broadcasttest/dimension")
if (countryDf != None) {
countryDf.get.unpersist()
}
countryDf = Some(
dimDf2
.withColumnRenamed("id", "id_2")
.withColumnRenamed("countrycode", "countrycode_2"))
countryDf.get.show()
}
factDf1.writeStream
.outputMode("append")
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.show(10)
updateDimensionDf()
batchDF
.join(
countryDf.get,
expr(
"""
countrycode_2 = countrycode
"""
),
"leftOuter"
)
.show
}
.start()
.awaitTermination()
}
}

RDD toDF() : Erroneous Behavior

I built a SparkStreaming App that fetches content from A Kafka Queue and intends to put the data into a MySQL table after some pre-processing and structuring.
I call the 'foreachRDD' method on the SparkStreamingContext. The issue that I'm facing is that there's dataloss between when I call saveAsTextFile on the RDD and DataFrame's write method with format("csv"). I can't seem to pin point why this is happening.
val ssc = new StreamingContext(spark.sparkContext, Seconds(60))
ssc.checkpoint("checkpoint")
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val stream = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
stream.foreachRDD {
rdd => {
rdd.saveAsTextFile("/Users/jarvis/rdds/"+new SimpleDateFormat("hh-mm-ss-dd-MM-yyyy").format(new Date)+"_rdd")
import spark.implicits._
val messagesDF = rdd.map(_.split("\t")).map( w => { Record ( w(0), autoTag( w(1),w(4) ) , w(2), w(3), w(4), w(5).substring(w(5).lastIndexOf("http://")), w(6).split("\n")(0) )}).toDF("recordTS","tag","channel_url","title","description","link","pub_TS")
messagesDF.write.format("csv").save(dumpPath+new SimpleDateFormat("hh-mm-ss-dd-MM-yyyy").format(new Date)+"_DF")
}
}
ssc.start()
ssc.awaitTermination()
There's data loss ie Many rows don't make it to the DataFrame from the RDD.
There's also replication: Many rows that do reach the Dataframe are replicated many times.
Found the error. Actually there was a wrong understanding about the ingested data format.
The intended data was "\t\t\t..." and hence the Row was supposed be split at "\n".
However the actual data was :
"\t\t\t...\n\t\t\t...\n"
So the rdd.map(...) operation needed another map for splitting at every "\n"

Resources