Hashmap load behaving inconsistently in spark streaming code client/cluster mode - apache-spark

I am creating a simple kafka spark streaming program where, I read the data from topic, process it using UDF and print the key,processed value in console.
Processig in uDF is like:
1.We get Ptyid and Date as Input.
2.If we get new Pty ID, we return a value as I+Date (Insert with date) to UDF and store it in hashmap for further comparison.
3.if we get existing ptyid, we compare the date of current record with date of record present in hashmap.
3a) If the date of current record is less than date available in hashmap,we return the value as "D"(Discard) to UDF.
3b) if the date of current record is more than date available in hashmap, we return the value as U+Date (Update) to UDF and store it in hashmap for further comparison.
Issue:
When i run the application in local mode, there is no issue and the I,U,D for each ptyid is behaving as expected.
When i run the same application in client mode, if I post same data in input topic multiple times, it behaves inconsistently and I+Date is generated multiple times for the same PtyID which has been already consumed.Expectation is if we post same data, it should give I(insert) for the first time, U/D as per comparison.But i get I+Date multiple times
below is the code snippet. Any help would be appreciated.
object teststr{
var hashMap = new ConcurrenthashMap[String,String]()
def main(args: Array[Sttring]):Unit = {
val spark = SparkSession.builder.appname("test").getOrCreate()
import spark.impicits._
spark.readstream.format("kafka")
.option ......(all properties)
.load()
.selectExpr("CAST(key AS STRING) as key", "CAST(value AS STRING) as value",offset","partition","timestamp")
.withColumn("value", processRecordsUDF(col("key"),col("value")))
.select("key","value")
.writestream
.format("console")
.outputMode("append")
.start().awaitTermination()
}
def processRecords(partyId: String,value:String):String = {
if (hashMap.containsKey(partyId))
{
if (value > hashMap.get(partyId))
{
hashMap.put(partyId,value)
return ("U"+value)
}
else
{
return ("D")
}}
else
{
hashMap.put(partyId,value)
return ("I"+value)
}}
def processRecordsUDF = udf(processRecords(_:String,_:String):String)
}

Related

Retrieve a String type column from a Spark Dataset as String variable, to pass that as a 'key' for the Redis cache

I am trying to use spark streaming to read the data from a kafka topic.
The message from kafka is a JSON which i am storing below in the value column of the dataset as String.
**Sample message : Just a sample, actual json is complex **
{
"Name": "Bauddhik",
"Profession": "Developer"
}
Dataset<Row> df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topic1")
.load()
.selectExpr("CAST(value AS STRING)");
Now as my Dataset have a value column with the entire JSON, I need to pick one of the field which i can use as a key while storing in Redis. Suppose the field is "Name" from the json.
So, first i did below select to take out the "name" field as a new column in my dataframe.
Dataset<Row> df1 = df.select(functions.col("value"), functions.get_json_object(functions.col("value"), "$['name']").as("name");
This works fine and now my df2 looks like
Value | name
<Json> | Bauddhik
Now i want this to be inserted to Redis cache with the key as 'Bauddhik' and the value as the entire Json. So i am using below foreachbatch option to persist in Redis.
df1.writeStream().foreachbatch (
new VoidFunction2<Dataset<Row>, Long>()
{
public void call (Dataset<Row> dataset, Long batchId) {
dataset.write()
.format("org.apache.spark.sql.redis")
.option("key.coloum", **<hereistheissue>**)
.option("table","test")
.mode(SaveMode.Overwrite)
.save();
}
}).start()
If you look at the above code (hereistheissue) , I need to paas the key as Bauddhik which i derived earlier as a seperate column in the Dataframe.
I am not able to retrieve the name column as string so i can pass it to the Redis cache as the key. I have tried using map and df.head().getString(1) but nothing seems to be working.
Can anyone please guide on how i can read a column from a dataset as a String and pass to the key option while writing to Redis cache.

Spark access data frame from outside foreach batch (Strcutred Streaming)

I want to create and update a data frame inside the foreach batch of a spark stream and access it outside the foreach batch iterator below is what I am trying to do in spark structured streaming.
Is it possible to access data frames which are created or updated inside foreach batch from outside for each batch in spark structured streaming ?
// assign a empty data frame
var df1: Option[DataFrame] = None: Option[DataFrame]
validatedFinalDf.writeStream
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
println("I am here printing batchDF")
batchDF.withColumn("extra", lit("batch-df")).show()
// un presist the data frame if it has data
if (df1 != None) {
df1.get.unpersist()
}
// assign data to data frame
df1 = Some(batchDF.withColumn("extra", lit("batch-df-dim")))
}.start()
// access data frame outside foreach not working stale data ....
if (df1 != None) {
df1.get.show()
}
spark.streams.awaitAnyTermination()
I cant even access temp tables which are created inside foreach batch from outside for each batch.
Even the data frame which is updated inside foreach batch shows stale data from outside foreach batch.
Thanks
Sri
The foreachBatch iterates over the collection and, if i don't mistake, expect an effectful operation (eg writes, print, etc).
However what you do inside the body is to assign a temporary result to an external var.
So there the problems:
conceptually that is wrong because, even if it had worked fine, you would end up with just the last Dataframe assigned to your var.
I think you need to start the operation as exemplified in the doc here
DF are immutable. If you want to change your DataFrame when use mapping functions (eg, withColumn) or other transformation API and return the new DF.
When you're satisfied with the result only then persist using the foreach / foreachBatch calls
A small work around made the trick , converted batch data frame to in memory stream which was accessed outside foreach batch.
case class StreamData(
account_id: String,
run_dt: String,
trxn_dt: String,
trxn_amt: String)
import spark.implicits._
implicit val ctx = spark.sqlContext
val streamDataSource = MemoryStream[StreamData]
source.writeStream
.foreachBatch { (batchDf: DataFrame, batchId: Long) =>
val batchDs = batchDf.as[StreamData]
val obj = batchDs
.map(x => StreamData(x.account_id,x.run_dt,x.trxn_dt, x.trxn_amt))
.collect()
streamDataSource.addData(obj)
}
.start()
val datasetStreaming: Dataset[StreamData] = streamDataSource.toDS()
println("This is the streaming dataset:")
datasetStreaming
.writeStream
.format("console")
.outputMode("append")
.start()
spark.streams.awaitAnyTermination()

How to enrich data of a streaming query and write the result to Elasticsearch?

For a given dataset (originalData) I'm required to map the values and then prepare a new dataset combining the search results from elasticsearch.
Dataset<Row> orignalData = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers","test")
.option("subscribe", "test")
.option("startingOffsets", "latest")
.load();
Dataset<Row> esData = JavaEsSparkSQL
.esDF(spark.sqlContext(), "spark_correlation/doc");
esData.createOrReplaceTempView("es_correlation");
List<SGEvent> listSGEvent = new ArrayList<>();
originalData.foreach((ForeachFunction<Row>) row -> {
SGEvent event = new SGEvent();
String sourceKey=row.get(4).toString();
String searchQuery = "select id from es_correlation where es_correlation.key='"+sourceKey+"'";
Dataset<Row> result = spark.sqlContext().sql(searchQuery);
String id = null;
if (result != null) {
result.show();
id = result.first().toString();
}
event.setId(id);
event.setKey(sourceKey);
listSGEvent.add(event)
}
Encoder<SGEvent> eventEncoderSG = Encoders.bean(SGEvent.class);
Dataset<Row> finalData = spark.createDataset(listSGEvent, eventEncoderSG).toDF();
finalData
.writeStream()
.outputMode(OutputMode.Append())
.format("org.elasticsearch.spark.sql")
.option("es.mapping.id", "id")
.option("es.write.operation", "upsert")
.option("checkpointLocation","/tmp/checkpoint/sg_event")
.start("spark_index/doc").awaitTermination();
Spark throws the following exception:
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:389)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:38)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:392)
Is my approach towards combing elasticsearch value with Dataset valid ? Is there any other better solution for this?
There are a couple of issues here.
As the exception says orignalData is a streaming query (streaming Dataset) and the only way to execute it is to use writeStream.start(). That's one issue.
You did writeStream.start() but with another query finalData which is not streaming but batch. That's another issue.
For "enrichment" cases like yours, you can use a streaming join (Dataset.join operator) or one of DataStreamWriter.foreach and DataStreamWriter.foreachBatch. I think DataStreamWriter.foreachBatch would be more efficient.
public DataStreamWriter<T> foreachBatch(VoidFunction2<Dataset<T>,Long> function)
(Java-specific) Sets the output of the streaming query to be processed using the provided function. This is supported only the in the micro-batch execution modes (that is, when the trigger is not continuous). In every micro-batch, the provided function will be called in every micro-batch with (i) the output rows as a Dataset and (ii) the batch identifier. The batchId can be used deduplicate and transactionally write the output (that is, the provided Dataset) to external systems. The output Dataset is guaranteed to exactly same for the same batchId (assuming all operations are deterministic in the query).
Not only would you get all the data of a streaming micro-batch in one shot (the first input argument of type Dataset<T>), but also a way to submit another Spark job (across executors) based on the data.
The pseudo-code could look as follows (I'm using Scala as I'm more comfortable with the language):
val dsWriter = originalData.foreachBatch { case (data, batchId) =>
// make sure the data is small enough to collect on the driver
// Otherwise expect OOME
// It'd also be nice to have a Java bean to convert the rows to proper types and names
val localData = data.collect
// Please note that localData is no longer Spark's Dataset
// It's a local Java collection
// Use Java Collection API to work with the localData
// e.g. using Scala
// You're mapping over localData (for a single micro-batch)
// And creating finalData
// I'm using the same names as your code to be as close to your initial idea as possible
val finalData = localData.map { row =>
// row is the old row from your original code
// do something with it
// e.g. using Java
String sourceKey=row.get(4).toString();
...
}
// Time to save the data processed to ES
// finalData is a local Java/Scala collection not Spark's DataFrame!
// Let's convert it to a DataFrame (and leverage the Spark distributed platform)
// Note that I'm almost using your code, but it's a batch query not a streaming one
// We're inside foreachBatch
finalData
.toDF // Convert a local collection to a Spark DataFrame
.write // this creates a batch query
.format("org.elasticsearch.spark.sql")
.option("es.mapping.id", "id")
.option("es.write.operation", "upsert")
.option("checkpointLocation","/tmp/checkpoint/sg_event")
.save("spark_index/doc") // save (not start) as it's a batch query inside a streaming query
}
dsWriter is a DataStreamWriter and you can now start it to start the streaming query.
I was able to achieve actual solution by using SQL Joins.
Please refer the code below.
Dataset<Row> orignalData = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers","test")
.option("subscribe", "test")
.option("startingOffsets", "latest")
.load();
orignalData.createOrReplaceTempView("stream_data");
Dataset<Row> esData = JavaEsSparkSQL
.esDF(spark.sqlContext(), "spark_correlation/doc");
esData.createOrReplaceTempView("es_correlation");
Dataset<Row> joinedData = spark.sqlContext().sql("select * from stream_data,es_correlation where es_correlation.key=stream_data.key");
// Or
/* By using Dataset Join Operator
Dataset<Row> joinedData = orignalData.join(esFirst, "key");
*/
Encoder<SGEvent> eventEncoderSG = Encoders.bean(SGEvent.class);
Dataset<SGEvent> finalData = joinedData.map((MapFunction<Row, SGEvent>) row -> {
SGEvent event = new SGEvent();
event.setId(row.get(0).toString());
event.setKey(row.get(3).toString());
return event;
},eventEncoderSG);
finalData
.writeStream()
.outputMode(OutputMode.Append())
.format("org.elasticsearch.spark.sql")
.option("es.mapping.id", "id")
.option("es.write.operation", "upsert")
.option("checkpointLocation","/tmp/checkpoint/sg_event")
.start("spark_index/doc").awaitTermination();

Spark Structured Streaming - testing one batch at a time

I'm trying to create a test for a custom MicroBatchReadSupport DataSource which I've implemented.
For that, I want to invoke one batch at a time, which will read the data using this DataSource(I've created appropriate mocks). I want to invoke a batch, verify that the correct data was read (currently by saving it to a memory sink and checking the output), and only then invoke the next batch and verify it's output.
I couldn't find a way to invoke each batch after the other.
If I use streamingQuery.processAllAvailable(), the batches are invoked one after the other, without allowing me to verify the output for each one separately. Using trigger(Trigger.Once()) doesn't help as well, because it executes one batch and I can't continue to the next one.
Is there any way to do what I want?
Currently this is my basic code:
val dataFrame = sparkSession.readStream.format("my-custom-data-source").load()
val dsw: DataStreamWriter[Row] = dataFrame.writeStream
.format("memory")
.queryName("test_output")
val streamingQuery = dsw
.start()
streamingQuery.processAllAvailable()
What I've ended up doing is setting up the test with a DataStreamWriter which runs once, but saves the current status to a checkpoint. So each time we invoke dsw.start(), the new batch is resumed from the latest offset, according to the checkpoint. I'm also saving the data into a globalTempView, so I will be able to query the data in a similar way to using the memory sink. For doing that, I'm using foreachBatch (which is only available since Spark 2.4).
This is in code:
val dataFrame = sparkSession.readStream.format("my-custom-data-source").load()
val dsw = getNewDataStreamWriter(dataFrame)
testFirstBatch(dsw)
testSecondBatch(dsw)
private def getNewDataStreamWriter(dataFrame: DataFrame) = {
val checkpointTempDir = Files.createTempDirectory("tests").toAbsolutePath.toString
val dsw: DataStreamWriter[Row] = dataFrame.writeStream
.trigger(Trigger.Once())
.option("checkpointLocation", checkpointTempDir)
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.createOrReplaceGlobalTempView("input_data")
}
dsw
}
And the actual test code for each batch (e.g. testFirstBatch) is:
val rows = processNextBatch(dsw)
assertResult(10)(rows.length)
private def processNextBatch(dsw: DataStreamWriter[Row]) = {
val streamingQuery = dsw
.start()
streamingQuery.processAllAvailable()
sparkSession.sql("select * from global_temp.input_data").collect()
}

spark structured streaming dynamic string filter

We are trying to use dynamic filter for a structured streaming application.
Let's say we have following pseudo-implementation of a Spark structured streaming application:
spark.readStream()
.format("kafka")
.option(...)
...
.load()
.filter(getFilter()) <-- dynamic staff - def filter(conditionExpr: String):
.writeStream()
.format("kafka")
.option(.....)
.start();
and getFilter returns string
String getFilter() {
// dynamic staff to create expression
return expression; // eg. "column = true";
}
Is it possible in current version of Spark to have a dynamic filter condition? I mean the getFilter() method should dynamically return a filter condition (let's say it's refreshed each 10min). We tried to look into broadcast variable but not sure whether structured streaming supports such a thing.
It looks like it's not possible to update job's configuration once it's submitted. As a deploy we use yarn.
Every suggestion/option is highly appreciated.
EDIT:
assume getFilter() returns:
(columnA = 1 AND columnB = true) OR customHiveUDF(columnC, 'input') != 'required' OR columnD > 8
after 10 mins we can have small change (without first expression before first OR) and potentially we can have a new expression (columnA = 2) eg:
customHiveUDF(columnC, 'input') != 'required' OR columnD > 10 OR columnA = 2
The goal is to have multiple filters for one spark application and don't submit multiple jobs.
Broadcast variable should be ok here. You can write typed filter like:
query.filter(x => x > bv.value).writeStream(...)
where bv is a Broadcast variable. You can update it as described here: How can I update a broadcast variable in spark streaming?
Other solution is to provide i.e. RCP or RESTful endpoint and ask this endpoint every 10 minutes. For example (Java, because is simpler here):
class EndpointProxy {
Configuration lastValue;
long lastUpdated
public static Configuration getConfiguration (){
if (lastUpdated + refreshRate > System.currentTimeMillis()){
lastUpdated = System.currentTimeMillis();
lastValue = askMyAPI();
}
return lastValue;
}
}
query.filter (x => x > EndpointProxy.getConfiguration().getX()).writeStream()
Edit: hacky workaround for user's problem:
You can create 1-row view:
// confsDF should be in some driver-side singleton
var confsDF = Seq(some content).toDF("someColumn")
and then use:
query.crossJoin(confsDF.as("conf")) // cross join as we have only 1 value
.filter("hiveUDF(conf.someColumn)")
.writeStream()...
new Thread() {
confsDF = Seq(some new data).toDF("someColumn)
}.start();
This hack relies on Spark default execution model - microbatches. In each trigger the query is being rebuilt, so new data should be taken into consideration.
You can also in thread do:
Seq(some new data).toDF("someColumn).createOrReplaceTempView("conf")
and then in query:
.crossJoin(spark.table("conf"))
Both should work. Have in mind that it won't work with Continous Processing Mode
Here is the Simple example, In which i am dynamic filtering records which is coming form socket. Instead of Date you can use any rest API which can update your filter dynamically or light weight zookeeper instance.
Note: - If you planning to use any rest API or zookeeper or any other option, use mapPartition instead of filter because in that case you have call API/Connection one time for a partition.
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
// Split the lines into words
val words = lines.as[String].filter(_ == new java.util.Date().getMinutes.toString)
// Generate running word count
val wordCounts = words.groupBy("value").count()
val query = wordCounts.writeStream
.outputMode("complete")
.format("console")
.start()
query.awaitTermination()

Resources