Spark dataframe lose streaming capability after accessing Kafka source - apache-spark

I use Spark 2.4.3 and Kafka 2.3.0. I want to do Spark structured streaming with data coming from Kafka to Spark. In general it does work in the test mode but since I have to do some processing of the data (and do not know another way to do) the Spark data frames do not have the streaming capability anymore.
#!/usr/bin/env python3
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json
from pyspark.sql.types import StructField, StructType, StringType, DoubleType
# create schema for data
schema = StructType([StructField("Signal", StringType()),StructField("Value", DoubleType())])
# create spark session
spark = SparkSession.builder.appName("streamer").getOrCreate()
# create DataFrame representing the stream
dsraw = spark.readStream \
.format("kafka").option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "test")
print("dsraw.isStreaming: ", dsraw.isStreaming)
# Convert Kafka stream to something readable
ds = dsraw.selectExpr("CAST(value AS STRING)")
print("ds.isStreaming: ", ds.isStreaming)
# Do query on the converted data
dsQuery = ds.writeStream.queryName("ds_query").format("memory").start()
df1 = spark.sql("select * from ds_query")
print("df1.isStreaming: ", df1.isStreaming)
# convert json into spark dataframe cols
df2 = df1.withColumn("value", from_json("value", schema))
print("df2.isStreaming: ", df2.isStreaming)
The output is:
dsraw.isStreaming: True
ds.isStreaming: True
df1.isStreaming: False
df2.isStreaming: False
So I lose the streaming capability when I create the first dataframe. How can I avoid it? How do I get a streaming Spark dataframe out of a stream?

It is not recommend to use the memory sink for production applications as all the data will be stored in the driver.
There is also no reason to do this, except for debugging purposes, as you can process your streaming dataframes like the 'normal' dataframes. For example:
import pyspark.sql.functions as F
lines = spark.readStream.format("socket").option("host", "XXX.XXX.XXX.XXX").option("port", XXXXX).load()
words = lines.select(lines.value)
words = words.filter(words.value.startswith('h'))
wordCounts = words.groupBy("value").count()
wordCounts = wordCounts.withColumn('count', F.col('count') + 2)
query = wordCounts.writeStream.queryName("test").outputMode("complete").format("memory").start()
In case you still want to go with your approach: Even if df.isStreaming tells you it is not a streaming dataframe (which is correct), the underlying datasource is a stream and the dataframe will therefore grow with each processed batch.

Related

How to store data from a dataframe in a variable to use as a parameter in a select in cassandra?

I have a Spark Structured Streaming application. The application receives data from kafka, and should use these values ​​as a parameter to process data from a cassandra database. My question is how do I use the data that is in the input dataframe (kafka), as "where" parameters in cassandra "select" without taking the error below:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();
This is my df input:
val df = spark
.readStream
.format("kafka")
.options(
Map("kafka.bootstrap.servers"-> kafka_bootstrap,
"subscribe" -> kafka_topic,
"startingOffsets"-> "latest",
"fetchOffset.numRetries"-> "5",
"kafka.group.id"-> groupId
))
.load()
I get this error whenever I try to store the dataframe values ​​in a variable to use as a parameter.
This is the method I created to try to convert the data into variables. With that the spark give the error that I mentioned earlier:
def processData(messageToProcess: DataFrame): DataFrame = {
val messageDS: Dataset[Message] = messageToProcess.as[Message]
val listData: Array[Message] = messageDS.collect()
listData.foreach(x => println(x.country))
val mensagem = messageToProcess
mensagem
}
When you need to use data in Kafka to query data in Cassandra, then such operation is a typical join between two datasets - you don't need to call .collect to find entries, you just do the join. And it's quite typical thing - to enrich data in Kafka with data from the external dataset, and Cassandra provides low-latency operations.
Your code could look as following (you'll need to configure so-called DirectJoin, see link below):
import spark.implicits._
import org.apache.spark.sql.cassandra._
val df = spark.readStream.format("kafka")
.options(Map(...)).load()
... decode data in Kafka into columns
val cassdata = spark.read.cassandraFormat("table", "keyspace").load
val joined = df.join(cassdata, cassdata("pk") === df("some_column"))
val processed = ... process joined data
val query = processed.writeStream.....output data somewhere...start()
query.awaitTermination()
I have detailed blog post on how to perform efficient joins with data in Cassandra.
As the error message suggest, you have to use writeStream.start() in order to execute a Structured Streaming query.
You can't use the same actions you use for batch dataframes (like .collect(), .show() or .count()) on streaming dataframes, see the Unsupported Operations section of the Spark Structured Streaming documentation.
In your case, you are trying to use messageDS.collect() on a streaming dataset, which is not allowed. To achieve this goal you can use a foreachBatch output sink to collect the rows you need at each microbatch:
streamingDF.writeStream.foreachBatch { (microBatchDf: DataFrame, batchId: Long) =>
// Now microBatchDf is no longer a streaming dataframe
// you can check with microBatchDf.isStreaming
val messageDS: Dataset[Message] = microBatchDf.as[Message]
val listData: Array[Message] = messageDS.collect()
listData.foreach(x => println(x.country))
// ...
}

How to refresh loaded dataframe contents in spark streaming?

Using spark-sql 2.4.1 and kafka for real time streaming.
I have following use case
Need to load a meta-data from hdfs for joining with streaming dataframe from kafka.
streaming data record's particular columns should be looked up in meta-data dataframe particular colums(col-X) data.
If found pick meta-data column(col-Y) data
Else not found , insert streaming record/column data into meta-data dataframe i.e. into hdfs. I.e. it should be looked up if
streaming dataframe contain same data again.
As meta-data loaded at the beginning of the spark job how to refresh its contents again in the streaming-job to lookup and join with another streaming dataframe ?
I may have misunderstood the question, but refreshing the metadata dataframe should be a feature supported out of the box.
You simply don't have to do anything.
Let's have a look at the example:
// a batch dataframe
val metadata = spark.read.text("metadata.txt")
scala> metadata.show
+-----+
|value|
+-----+
|hello|
+-----+
// a streaming dataframe
val stream = spark.readStream.text("so")
// join on the only value column
stream.join(metadata, "value").writeStream.format("console").start
As long as the content of the files in so directory matches metadata.txt file you should get a dataframe printed out to the console.
-------------------------------------------
Batch: 1
-------------------------------------------
+-----+
|value|
+-----+
|hello|
+-----+
Change metadata.txt to, say, world and only worlds from new files get matched.
EDIT This solution is more elaborate and would work (for all use cases).
For simpler cases where the data is appended to existing files without changing the files or read from the databases simpler solution can be used as pointed out in the other answer.
This is because the dataframe (and underlying RDD) partitions are created once and the data is read everytime the datafframe is used. (unless it is cached by spark)
If can afford it you can try to (re)read this meta-data dataframe in every micro-bacth.
A better approach would be to put the meta-data dataframe in a cache (not to be confused with spark caching the dataframe). A cache is similar to a map except that it will not not give entries inserted more than the configured time-to-live duration.
In your code you'll try to fetch this meta-data dataframe from the cache once for every micro batch. If the cache return null. You'll read the data frame again, put into cache and then use the dataframe.
The Cache class would be
import scala.collection.mutable
// cache class to store the dataframe
class Cache[K, V](timeToLive: Long) extends mutable.Map[K, V] {
private var keyValueStore = mutable.HashMap[K, (V, Long)]()
override def get(key: K):Option[V] = {
keyValueStore.get(key) match {
case Some((value, insertedAt)) if insertedAt+timeToLive > System.currentTimeMillis => Some(value)
case _ => None
}
}
override def iterator: Iterator[(K, V)] = keyValueStore.iterator
.filter({
case (key, (value, insertedAt)) => insertedAt+timeToLive > System.currentTimeMillis
}).map(x => (x._1, x._2._1))
override def -=(key: K): this.type = {
keyValueStore-=key
this
}
override def +=(kv: (K, V)): this.type = {
keyValueStore += ((kv._1, (kv._2, System.currentTimeMillis())))
this
}
}
The logic to access the meta-data dataframe through the cache
import org.apache.spark.sql.DataFrame
object DataFrameCache {
lazy val cache = new Cache[String, DataFrame](600000) // ten minutes timeToLive
def readMetaData: DataFrame = ???
def getMetaData: DataFrame = {
cache.get("metadataDF") match {
case Some(df) => df
case None => {
val metadataDF = readMetaData
cache.put("metadataDF", metadataDF)
metadataDF
}
}
}
}
Below is the scenario which I followed in spark 2.4.5 for left outer join with stream join.Below process is pushing spark to read latest dimension data changes.
Process is for Stream Join with batch dimension (always update)
Step 1:-
Before starting Spark streaming job:-
Make sure dimension batch data folder has only one file and the file should have at-least one record (for some reason placing empty file is not working).
Step 2:-
Start your streaming job and add a stream record in kafka stream
Step 3:-
Overwrite dim data with values (the file should be same name don't change and the dimension folder should have only one file)
Note:- don't use spark to write to this folder use Java or Scala filesystem.io to overwrite the file or bash delete the file and replace with new data file with same name.
Step 4:-
In next batch spark is able to read updated dimension data while joining with kafka stream...
Sample Code:-
package com.broccoli.streaming.streamjoinupdate
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.types.{StringType, StructField, StructType, TimestampType}
import org.apache.spark.sql.{DataFrame, SparkSession}
object BroadCastStreamJoin3 {
def main(args: Array[String]): Unit = {
#transient lazy val logger: Logger = Logger.getLogger(getClass.getName)
Logger.getLogger("akka").setLevel(Level.WARN)
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("com.amazonaws").setLevel(Level.ERROR)
Logger.getLogger("com.amazon.ws").setLevel(Level.ERROR)
Logger.getLogger("io.netty").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.master("local")
.getOrCreate()
val schemaUntyped1 = StructType(
Array(
StructField("id", StringType),
StructField("customrid", StringType),
StructField("customername", StringType),
StructField("countrycode", StringType),
StructField("timestamp_column_fin_1", TimestampType)
))
val schemaUntyped2 = StructType(
Array(
StructField("id", StringType),
StructField("countrycode", StringType),
StructField("countryname", StringType),
StructField("timestamp_column_fin_2", TimestampType)
))
val factDf1 = spark.readStream
.schema(schemaUntyped1)
.option("header", "true")
.csv("src/main/resources/broadcasttest/fact")
val dimDf3 = spark.read
.schema(schemaUntyped2)
.option("header", "true")
.csv("src/main/resources/broadcasttest/dimension")
.withColumnRenamed("id", "id_2")
.withColumnRenamed("countrycode", "countrycode_2")
import spark.implicits._
factDf1
.join(
dimDf3,
$"countrycode_2" <=> $"countrycode",
"inner"
)
.writeStream
.format("console")
.outputMode("append")
.start()
.awaitTermination
}
}
Thanks
Sri

Call a function with each element a stream in Databricks

I have a DataFrame stream in Databricks, and I want to perform an action on each element. On the net I found specific purpose methods, like writing it to the console or dumping into memory, but I want to add some business logic, and put some results into Redis.
To be more specific, this is how it would look like in non-stream case:
val someDataFrame = Seq(
("key1", "value1"),
("key2", "value2"),
("key3", "value3"),
("key4", "value4")
).toDF()
def someFunction(keyValuePair: (String, String)) = {
println(keyValuePair)
}
someDataFrame.collect.foreach(r => someFunction((r(0).toString, r(1).toString)))
But if the someDataFrame is not a simple data frame but a stream data frame (indeed coming from Kafka), the error message is this:
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
Could anyone please help me solving this problem?
Some important notes:
I've read the relevant documentation, like Spark Streaming or Databricks Streaming and a few other descriptions as well.
I know that there must be something like start() and awaitTermination, but I don't know the exact syntax. The descriptions did not help.
It would take pages to list all the possibilities I tried, so I rather not provide them.
I do not want to solve the specific problem of displaying the result. I.e. please do not provide a solution to this specific case. The someFunction would look like this:
val someData = readSomeExternalData()
if (condition containing keyValuePair and someData) {
doSomething(keyValuePair);
}
(Question What is the purpose of ForeachWriter in Spark Structured Streaming? does not provide a working example, therefore does not answer my question.)
Here is an example of reading using foreachBatch to save every item to redis using the streaming api.
Related to a previous question (DataFrame to RDD[(String, String)] conversion)
// import spark and spark-redis
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.streaming._
import org.apache.spark.sql.types._
import com.redislabs.provider.redis._
// schema of csv files
val userSchema = new StructType()
.add("name", "string")
.add("age", "string")
// create a data stream reader from a dir with csv files
val csvDF = spark
.readStream
.format("csv")
.option("sep", ";")
.schema(userSchema)
.load("./data") // directory where the CSV files are
// redis
val redisConfig = new RedisConfig(new RedisEndpoint("localhost", 6379))
implicit val readWriteConfig: ReadWriteConfig = ReadWriteConfig.Default
csvDF.map(r => (r.getString(0), r.getString(0))) // converts the dataset to a Dataset[(String, String)]
.writeStream // create a data stream writer
.foreachBatch((df, _) => sc.toRedisKV(df.rdd)(redisConfig)) // save each batch to redis after converting it to a RDD
.start // start processing
Call simple user defined function foreachbatch in spark streaming.
please try this,
it will print 'hello world' for every message from tcp socket
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession .builder .appName("StructuredNetworkWordCount") .getOrCreate()
# Create DataFrame representing the stream of input lines from connection tolocalhost:9999
lines = spark .readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load()
# Split the lines into words
words = lines.select(
explode(
split(lines.value, " ")
).alias("word")
)
# Generate running word count
wordCounts = words.groupBy("word").count()
# Start running the query that prints the running counts to the console
def process_row(df, epoch_id):
# # Write row to storage
print('hello world')
query = words.writeStream.foreachBatch(process_row).start()
#query = wordCounts .writeStream .outputMode("complete") .format("console") .start()
query.awaitTermination()

How to transform structured streams with PySpark?

This seems like it should be obvious, but in reviewing the docs and examples, I'm not sure I can find a way to take a structured stream and transform using PySpark.
For example:
from pyspark.sql import SparkSession
spark = (
SparkSession
.builder
.appName('StreamingWordCount')
.getOrCreate()
)
raw_records = (
spark
.readStream
.format('socket')
.option('host', 'localhost')
.option('port', 9999)
.load()
)
# I realize there's a SQL function for upper-case, just illustrating a sample
# use of an arbitrary map function
records = raw_records.rdd.map(lambda w: w.upper()).toDF()
counts = (
records
.groupBy(records.value)
.count()
)
query = (
counts
.writeStream
.outputMode('complete')
.format('console')
.start()
)
query.awaitTermination()
This will throw the following exception:
Queries with streaming sources must be executed with writeStream.start
However, if I remove the call to rdd.map(...).toDF() things seem to work fine.
Seems as though the call to rdd.map branched execution from the streaming context and causes Spark to warn that it was never started?
Is there a "right" way to apply map or mapPartition style transformations using Structured Streaming and PySpark?
Every transformation that is applied in Structured Streaming has to be fully contained in Dataset world - in case of PySpark it means you can use only DataFrame or SQL and conversion to RDD (or DStream or local collections) are not supported.
If you want to use plain Python code you have to use UserDefinedFunction.
from pyspark.sql.functions import udf
#udf
def to_upper(s)
return s.upper()
raw_records.select(to_upper("value"))
See also Spark Structured Streaming and Spark-Ml Regression
Another way for a specific column (column_name):
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def to_uper(string):
return string.upper()
to_upper_udf = udf(to_upper,StringType())
records = raw_records.withColumn("new_column_name"
,to_upper_udf(raw_records['column_name']))\
.drop("column_name")

How to print Json encoded messages using Spark Structured Streaming

I have a DataSet[Row] where each row is JSON string. I want to just print the JSON stream or count the JSON stream per batch.
Here is my code so far
val ds = sparkSession.readStream()
.format("kafka")
.option("kafka.bootstrap.servers",bootstrapServers"))
.option("subscribe", topicName)
.option("checkpointLocation", hdfsCheckPointDir)
.load();
val ds1 = ds.select(from_json(col("value").cast("string"), schema) as 'payload)
val ds2 = ds1.select($"payload.info")
val query = ds2.writeStream.outputMode("append").queryName("table").format("memory").start()
query.awaitTermination()
select * from table; -- don't see anything and there are no errors. However when I run my Kafka consumer separately (independent ofSpark I can see the data)
My question really is what do I need to do just print the data I am receiving from Kafka using Structured Streaming? The messages in Kafka are JSON encoded strings so I am converting JSON encoded strings to some struct and eventually to a dataset. I am using Spark 2.1.0
val ds1 = ds.select(from_json(col("value").cast("string"), schema) as payload).select($"payload.*")
That will print your data on the console.
ds1.writeStream.format("console").option("truncate", "false").start().awaitTermination()
Always use something like awaitTermination() or thread.Sleep(time in seconds) in these type of situations.

Resources