spark structured streaming dynamic string filter - apache-spark

We are trying to use dynamic filter for a structured streaming application.
Let's say we have following pseudo-implementation of a Spark structured streaming application:
spark.readStream()
.format("kafka")
.option(...)
...
.load()
.filter(getFilter()) <-- dynamic staff - def filter(conditionExpr: String):
.writeStream()
.format("kafka")
.option(.....)
.start();
and getFilter returns string
String getFilter() {
// dynamic staff to create expression
return expression; // eg. "column = true";
}
Is it possible in current version of Spark to have a dynamic filter condition? I mean the getFilter() method should dynamically return a filter condition (let's say it's refreshed each 10min). We tried to look into broadcast variable but not sure whether structured streaming supports such a thing.
It looks like it's not possible to update job's configuration once it's submitted. As a deploy we use yarn.
Every suggestion/option is highly appreciated.
EDIT:
assume getFilter() returns:
(columnA = 1 AND columnB = true) OR customHiveUDF(columnC, 'input') != 'required' OR columnD > 8
after 10 mins we can have small change (without first expression before first OR) and potentially we can have a new expression (columnA = 2) eg:
customHiveUDF(columnC, 'input') != 'required' OR columnD > 10 OR columnA = 2
The goal is to have multiple filters for one spark application and don't submit multiple jobs.

Broadcast variable should be ok here. You can write typed filter like:
query.filter(x => x > bv.value).writeStream(...)
where bv is a Broadcast variable. You can update it as described here: How can I update a broadcast variable in spark streaming?
Other solution is to provide i.e. RCP or RESTful endpoint and ask this endpoint every 10 minutes. For example (Java, because is simpler here):
class EndpointProxy {
Configuration lastValue;
long lastUpdated
public static Configuration getConfiguration (){
if (lastUpdated + refreshRate > System.currentTimeMillis()){
lastUpdated = System.currentTimeMillis();
lastValue = askMyAPI();
}
return lastValue;
}
}
query.filter (x => x > EndpointProxy.getConfiguration().getX()).writeStream()
Edit: hacky workaround for user's problem:
You can create 1-row view:
// confsDF should be in some driver-side singleton
var confsDF = Seq(some content).toDF("someColumn")
and then use:
query.crossJoin(confsDF.as("conf")) // cross join as we have only 1 value
.filter("hiveUDF(conf.someColumn)")
.writeStream()...
new Thread() {
confsDF = Seq(some new data).toDF("someColumn)
}.start();
This hack relies on Spark default execution model - microbatches. In each trigger the query is being rebuilt, so new data should be taken into consideration.
You can also in thread do:
Seq(some new data).toDF("someColumn).createOrReplaceTempView("conf")
and then in query:
.crossJoin(spark.table("conf"))
Both should work. Have in mind that it won't work with Continous Processing Mode

Here is the Simple example, In which i am dynamic filtering records which is coming form socket. Instead of Date you can use any rest API which can update your filter dynamically or light weight zookeeper instance.
Note: - If you planning to use any rest API or zookeeper or any other option, use mapPartition instead of filter because in that case you have call API/Connection one time for a partition.
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
// Split the lines into words
val words = lines.as[String].filter(_ == new java.util.Date().getMinutes.toString)
// Generate running word count
val wordCounts = words.groupBy("value").count()
val query = wordCounts.writeStream
.outputMode("complete")
.format("console")
.start()
query.awaitTermination()

Related

How to use spark to write to HBase using multi-thread

I'm using spark to write data to HBase, but at the writing stage, only one executor and one core are executing.
I wonder why my code is not writing properly or what should I do to make it write faster?
Here is my code:
val df = ss.sql("SQL")
HBaseTableWriterUtil.hbaseWrite(ss, tableList, df)
def hbaseWrite(ss:SparkSession,tableList: List[String], df:DataFrame): Unit ={
val tableName = tableList(0)
val rowKeyName = tableList(4)
val rowKeyType = tableList(5)
hbaseConf.set(TableOutputFormat.OUTPUT_TABLE, s"${tableName}")
//写入到HBase
val sc = ss.sparkContext
sc.hadoopConfiguration.addResource(hbaseConf)
val columns = df.columns
val result = df.rdd.mapPartitions(par=>{
par.map(row=>{
var rowkey:String =""
if("String".equals(rowKeyType)){
rowkey = row.getAs[String](rowKeyName)
}else if("Long".equals(rowKeyType)){
rowkey = row.getAs[Long](rowKeyName).toString
}
val put = new Put(Bytes.toBytes(rowkey))
for(name<-columns){
var value = row.get(row.fieldIndex(name))
if(value!=null){
put.addColumn(Bytes.toBytes("cf"),Bytes.toBytes(name),Bytes.toBytes(value.toString))
}
}
(new ImmutableBytesWritable,put)
})
})
val job = Job.getInstance(sc.hadoopConfiguration)
job.setOutputKeyClass(classOf[ImmutableBytesWritable])
job.setOutputValueClass(classOf[Result])
job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
result.saveAsNewAPIHadoopDataset(job.getConfiguration)
}
You may not control how many parallel execute may write to HBase.
Though you can start multiple Spark jobs in multiThreaded client program.
e.g. You can have a shell script which triggers multiple spark-submit command to induce parallelism. Each spark job can work on one set of data independent to each other and push into HBase.
This can also be done using Spark Java/Scala SparkLauncher API using it with Java concurrent API (e.g. Executor framework).
val sparkLauncher = new SparkLauncher
//Set Spark properties.only Basic ones are shown here.It will be overridden if properties are set in Main class.
sparkLauncher.setSparkHome("/path/to/SPARK_HOME")
.setAppResource("/path/to/jar/to/be/executed")
.setMainClass("MainClassName")
.setMaster("MasterType like yarn or local[*]")
.setDeployMode("set deploy mode like cluster")
.setConf("spark.executor.cores","2")
// Lauch spark application
val sparkLauncher1 = sparkLauncher.startApplication()
//get jobId
val jobAppId = sparkLauncher1.getAppId
//Get status of job launched.THis loop will continuely show statuses like RUNNING,SUBMITED etc.
while (true) {
println(sparkLauncher1.getState().toString)
}
However, the challenge is to track each of them for failure and automatic recovery. It may be tricky specially when partial data is already written into HBase. i.e. A job fails to process the complete set of data assigned to it. You may have to automatically clean the data from HBase before automatically retrigger.

Hashmap load behaving inconsistently in spark streaming code client/cluster mode

I am creating a simple kafka spark streaming program where, I read the data from topic, process it using UDF and print the key,processed value in console.
Processig in uDF is like:
1.We get Ptyid and Date as Input.
2.If we get new Pty ID, we return a value as I+Date (Insert with date) to UDF and store it in hashmap for further comparison.
3.if we get existing ptyid, we compare the date of current record with date of record present in hashmap.
3a) If the date of current record is less than date available in hashmap,we return the value as "D"(Discard) to UDF.
3b) if the date of current record is more than date available in hashmap, we return the value as U+Date (Update) to UDF and store it in hashmap for further comparison.
Issue:
When i run the application in local mode, there is no issue and the I,U,D for each ptyid is behaving as expected.
When i run the same application in client mode, if I post same data in input topic multiple times, it behaves inconsistently and I+Date is generated multiple times for the same PtyID which has been already consumed.Expectation is if we post same data, it should give I(insert) for the first time, U/D as per comparison.But i get I+Date multiple times
below is the code snippet. Any help would be appreciated.
object teststr{
var hashMap = new ConcurrenthashMap[String,String]()
def main(args: Array[Sttring]):Unit = {
val spark = SparkSession.builder.appname("test").getOrCreate()
import spark.impicits._
spark.readstream.format("kafka")
.option ......(all properties)
.load()
.selectExpr("CAST(key AS STRING) as key", "CAST(value AS STRING) as value",offset","partition","timestamp")
.withColumn("value", processRecordsUDF(col("key"),col("value")))
.select("key","value")
.writestream
.format("console")
.outputMode("append")
.start().awaitTermination()
}
def processRecords(partyId: String,value:String):String = {
if (hashMap.containsKey(partyId))
{
if (value > hashMap.get(partyId))
{
hashMap.put(partyId,value)
return ("U"+value)
}
else
{
return ("D")
}}
else
{
hashMap.put(partyId,value)
return ("I"+value)
}}
def processRecordsUDF = udf(processRecords(_:String,_:String):String)
}

How to enrich data of a streaming query and write the result to Elasticsearch?

For a given dataset (originalData) I'm required to map the values and then prepare a new dataset combining the search results from elasticsearch.
Dataset<Row> orignalData = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers","test")
.option("subscribe", "test")
.option("startingOffsets", "latest")
.load();
Dataset<Row> esData = JavaEsSparkSQL
.esDF(spark.sqlContext(), "spark_correlation/doc");
esData.createOrReplaceTempView("es_correlation");
List<SGEvent> listSGEvent = new ArrayList<>();
originalData.foreach((ForeachFunction<Row>) row -> {
SGEvent event = new SGEvent();
String sourceKey=row.get(4).toString();
String searchQuery = "select id from es_correlation where es_correlation.key='"+sourceKey+"'";
Dataset<Row> result = spark.sqlContext().sql(searchQuery);
String id = null;
if (result != null) {
result.show();
id = result.first().toString();
}
event.setId(id);
event.setKey(sourceKey);
listSGEvent.add(event)
}
Encoder<SGEvent> eventEncoderSG = Encoders.bean(SGEvent.class);
Dataset<Row> finalData = spark.createDataset(listSGEvent, eventEncoderSG).toDF();
finalData
.writeStream()
.outputMode(OutputMode.Append())
.format("org.elasticsearch.spark.sql")
.option("es.mapping.id", "id")
.option("es.write.operation", "upsert")
.option("checkpointLocation","/tmp/checkpoint/sg_event")
.start("spark_index/doc").awaitTermination();
Spark throws the following exception:
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:389)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:38)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:392)
Is my approach towards combing elasticsearch value with Dataset valid ? Is there any other better solution for this?
There are a couple of issues here.
As the exception says orignalData is a streaming query (streaming Dataset) and the only way to execute it is to use writeStream.start(). That's one issue.
You did writeStream.start() but with another query finalData which is not streaming but batch. That's another issue.
For "enrichment" cases like yours, you can use a streaming join (Dataset.join operator) or one of DataStreamWriter.foreach and DataStreamWriter.foreachBatch. I think DataStreamWriter.foreachBatch would be more efficient.
public DataStreamWriter<T> foreachBatch(VoidFunction2<Dataset<T>,Long> function)
(Java-specific) Sets the output of the streaming query to be processed using the provided function. This is supported only the in the micro-batch execution modes (that is, when the trigger is not continuous). In every micro-batch, the provided function will be called in every micro-batch with (i) the output rows as a Dataset and (ii) the batch identifier. The batchId can be used deduplicate and transactionally write the output (that is, the provided Dataset) to external systems. The output Dataset is guaranteed to exactly same for the same batchId (assuming all operations are deterministic in the query).
Not only would you get all the data of a streaming micro-batch in one shot (the first input argument of type Dataset<T>), but also a way to submit another Spark job (across executors) based on the data.
The pseudo-code could look as follows (I'm using Scala as I'm more comfortable with the language):
val dsWriter = originalData.foreachBatch { case (data, batchId) =>
// make sure the data is small enough to collect on the driver
// Otherwise expect OOME
// It'd also be nice to have a Java bean to convert the rows to proper types and names
val localData = data.collect
// Please note that localData is no longer Spark's Dataset
// It's a local Java collection
// Use Java Collection API to work with the localData
// e.g. using Scala
// You're mapping over localData (for a single micro-batch)
// And creating finalData
// I'm using the same names as your code to be as close to your initial idea as possible
val finalData = localData.map { row =>
// row is the old row from your original code
// do something with it
// e.g. using Java
String sourceKey=row.get(4).toString();
...
}
// Time to save the data processed to ES
// finalData is a local Java/Scala collection not Spark's DataFrame!
// Let's convert it to a DataFrame (and leverage the Spark distributed platform)
// Note that I'm almost using your code, but it's a batch query not a streaming one
// We're inside foreachBatch
finalData
.toDF // Convert a local collection to a Spark DataFrame
.write // this creates a batch query
.format("org.elasticsearch.spark.sql")
.option("es.mapping.id", "id")
.option("es.write.operation", "upsert")
.option("checkpointLocation","/tmp/checkpoint/sg_event")
.save("spark_index/doc") // save (not start) as it's a batch query inside a streaming query
}
dsWriter is a DataStreamWriter and you can now start it to start the streaming query.
I was able to achieve actual solution by using SQL Joins.
Please refer the code below.
Dataset<Row> orignalData = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers","test")
.option("subscribe", "test")
.option("startingOffsets", "latest")
.load();
orignalData.createOrReplaceTempView("stream_data");
Dataset<Row> esData = JavaEsSparkSQL
.esDF(spark.sqlContext(), "spark_correlation/doc");
esData.createOrReplaceTempView("es_correlation");
Dataset<Row> joinedData = spark.sqlContext().sql("select * from stream_data,es_correlation where es_correlation.key=stream_data.key");
// Or
/* By using Dataset Join Operator
Dataset<Row> joinedData = orignalData.join(esFirst, "key");
*/
Encoder<SGEvent> eventEncoderSG = Encoders.bean(SGEvent.class);
Dataset<SGEvent> finalData = joinedData.map((MapFunction<Row, SGEvent>) row -> {
SGEvent event = new SGEvent();
event.setId(row.get(0).toString());
event.setKey(row.get(3).toString());
return event;
},eventEncoderSG);
finalData
.writeStream()
.outputMode(OutputMode.Append())
.format("org.elasticsearch.spark.sql")
.option("es.mapping.id", "id")
.option("es.write.operation", "upsert")
.option("checkpointLocation","/tmp/checkpoint/sg_event")
.start("spark_index/doc").awaitTermination();

Spark Structured Streaming - testing one batch at a time

I'm trying to create a test for a custom MicroBatchReadSupport DataSource which I've implemented.
For that, I want to invoke one batch at a time, which will read the data using this DataSource(I've created appropriate mocks). I want to invoke a batch, verify that the correct data was read (currently by saving it to a memory sink and checking the output), and only then invoke the next batch and verify it's output.
I couldn't find a way to invoke each batch after the other.
If I use streamingQuery.processAllAvailable(), the batches are invoked one after the other, without allowing me to verify the output for each one separately. Using trigger(Trigger.Once()) doesn't help as well, because it executes one batch and I can't continue to the next one.
Is there any way to do what I want?
Currently this is my basic code:
val dataFrame = sparkSession.readStream.format("my-custom-data-source").load()
val dsw: DataStreamWriter[Row] = dataFrame.writeStream
.format("memory")
.queryName("test_output")
val streamingQuery = dsw
.start()
streamingQuery.processAllAvailable()
What I've ended up doing is setting up the test with a DataStreamWriter which runs once, but saves the current status to a checkpoint. So each time we invoke dsw.start(), the new batch is resumed from the latest offset, according to the checkpoint. I'm also saving the data into a globalTempView, so I will be able to query the data in a similar way to using the memory sink. For doing that, I'm using foreachBatch (which is only available since Spark 2.4).
This is in code:
val dataFrame = sparkSession.readStream.format("my-custom-data-source").load()
val dsw = getNewDataStreamWriter(dataFrame)
testFirstBatch(dsw)
testSecondBatch(dsw)
private def getNewDataStreamWriter(dataFrame: DataFrame) = {
val checkpointTempDir = Files.createTempDirectory("tests").toAbsolutePath.toString
val dsw: DataStreamWriter[Row] = dataFrame.writeStream
.trigger(Trigger.Once())
.option("checkpointLocation", checkpointTempDir)
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.createOrReplaceGlobalTempView("input_data")
}
dsw
}
And the actual test code for each batch (e.g. testFirstBatch) is:
val rows = processNextBatch(dsw)
assertResult(10)(rows.length)
private def processNextBatch(dsw: DataStreamWriter[Row]) = {
val streamingQuery = dsw
.start()
streamingQuery.processAllAvailable()
sparkSession.sql("select * from global_temp.input_data").collect()
}

How to write DataFrame (built from RDD inside foreach) to Kafka?

I'm trying to write a DataFrame from Spark to Kafka and I couldn't find any solution out there. Can you please show me how to do that?
Here is my current code:
activityStream.foreachRDD { rdd =>
val activityDF = rdd
.toDF()
.selectExpr(
"timestamp_hour", "referrer", "action",
"prevPage", "page", "visitor", "product", "inputProps.topic as topic")
val producerRecord = new ProducerRecord(topicc, activityDF)
kafkaProducer.send(producerRecord) // <--- this shows an error
}
type mismatch; found : org.apache.kafka.clients.producer.ProducerRecord[Nothing,org‌​.apache.spark.sql.Da‌​taFrame] (which expands to) org.apache.kafka.clients.producer.ProducerRecord[Nothing,org‌​.apache.spark.sql.Da‌​taset[org.apache.spa‌​rk.sql.Row]] required: org.apache.kafka.clients.producer.ProducerRecord[Nothing,Str‌​ing] Error occurred in an application involving default arguments.
Do collect on the activityDF to get the records (not Dataset[Row]) and save them to Kafka.
Note that you'll end up with a collection of records after collect so you probably have to iterate over it, e.g.
val activities = activityDF.collect()
// the following is pure Scala and has nothing to do with Spark
activities.foreach { a: Row =>
val pr: ProducerRecord = // map a to pr
kafkaProducer.send(pr)
}
Use pattern matching on Row to destructure it to fields/columns, e.g.
activities.foreach { case Row(timestamp_hour, referrer, action, prevPage, page, visitor, product, topic) =>
// ...transform a to ProducerRecord
kafkaProducer.send(pr)
}
PROTIP: I'd strongly suggest using a case class and transform DataFrame (= Dataset[Row]) to Dataset[YourCaseClass].
See Spark SQL's Row and Kafka's ProducerRecord docs.
As Joe Nate pointed out in the comments:
If you do "collect" before writing to any endpoint, it's going to make all the data aggregate at the driver and then make the driver write it out. 1) Can crash the driver if too much data (2) no parallelism in write.
That's 100% correct. I wished I had said it :)
You may want to use the approach as described in Writing Stream Output to Kafka instead.

Resources