How to refresh a dataframe in spark structured streaming [duplicate] - apache-spark

In my Spark streaming application, I want to map a value based on a dictionary that's retrieved from a backend (ElasticSearch). I want to periodically refresh the dictionary periodically, in case it was updated in the backend. It would be similar to Logstash translate filter's periodic refresh capability. How could I achieve this with Spark (e.g. somehow unpersist the RDD every 30 seconds)?

The best way I've found to do that is to recreate the RDD and maintain a mutable reference to it. Spark Streaming is at its core an scheduling framework on top of Spark. We can piggy-back on the scheduler to have the RDD refreshed periodically. For that, we use an empty DStream that we schedule only for the refresh operation:
def getData():RDD[Data] = ??? function to create the RDD we want to use af reference data
val dstream = ??? // our data stream
// a dstream of empty data
val refreshDstream = new ConstantInputDStream(ssc, sparkContext.parallelize(Seq())).window(Seconds(refreshInterval),Seconds(refreshInterval))
var referenceData = getData()
referenceData.cache()
refreshDstream.foreachRDD{_ =>
// evict the old RDD from memory and recreate it
referenceData.unpersist(true)
referenceData = getData()
referenceData.cache()
}
val myBusinessData = dstream.transform(rdd => rdd.join(referenceData))
... etc ...
In the past, I've also tried only with interleaving cache() and unpersist() with no result (it refreshes only once). Recreating the RDD removes all lineage and provides a clean load of the new data.

Steps :
Cache once before Starting streaming
Clear cache after a certain period (example here for 30 minutes)
Optional: Hive table repair via spark can be added to init.
spark.sql("msck repair table tableName")
import java.time.LocalDateTime
var caching_data = Data.init()
caching_data.persist()
var currTime = LocalDateTime.now()
var cacheClearTime = currTime.plusMinutes(30) // Put your time in Units
DStream.foreachRDD(rdd => if (rdd.take(1).length > 0) {
//Clear and Cache again
currTime = LocalDateTime.now()
val dateDiff = cacheClearTime.isBefore(currTime)
if (dateDiff) {
caching_data.unpersist(true) //blocking unpersist on boolean = true
caching_data = Data.init()
caching_data.persist()
currTime = LocalDateTime.now()
cacheClearTime = currTime.plusMinutes(30)
}
})

Related

How to use spark to write to HBase using multi-thread

I'm using spark to write data to HBase, but at the writing stage, only one executor and one core are executing.
I wonder why my code is not writing properly or what should I do to make it write faster?
Here is my code:
val df = ss.sql("SQL")
HBaseTableWriterUtil.hbaseWrite(ss, tableList, df)
def hbaseWrite(ss:SparkSession,tableList: List[String], df:DataFrame): Unit ={
val tableName = tableList(0)
val rowKeyName = tableList(4)
val rowKeyType = tableList(5)
hbaseConf.set(TableOutputFormat.OUTPUT_TABLE, s"${tableName}")
//写入到HBase
val sc = ss.sparkContext
sc.hadoopConfiguration.addResource(hbaseConf)
val columns = df.columns
val result = df.rdd.mapPartitions(par=>{
par.map(row=>{
var rowkey:String =""
if("String".equals(rowKeyType)){
rowkey = row.getAs[String](rowKeyName)
}else if("Long".equals(rowKeyType)){
rowkey = row.getAs[Long](rowKeyName).toString
}
val put = new Put(Bytes.toBytes(rowkey))
for(name<-columns){
var value = row.get(row.fieldIndex(name))
if(value!=null){
put.addColumn(Bytes.toBytes("cf"),Bytes.toBytes(name),Bytes.toBytes(value.toString))
}
}
(new ImmutableBytesWritable,put)
})
})
val job = Job.getInstance(sc.hadoopConfiguration)
job.setOutputKeyClass(classOf[ImmutableBytesWritable])
job.setOutputValueClass(classOf[Result])
job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
result.saveAsNewAPIHadoopDataset(job.getConfiguration)
}
You may not control how many parallel execute may write to HBase.
Though you can start multiple Spark jobs in multiThreaded client program.
e.g. You can have a shell script which triggers multiple spark-submit command to induce parallelism. Each spark job can work on one set of data independent to each other and push into HBase.
This can also be done using Spark Java/Scala SparkLauncher API using it with Java concurrent API (e.g. Executor framework).
val sparkLauncher = new SparkLauncher
//Set Spark properties.only Basic ones are shown here.It will be overridden if properties are set in Main class.
sparkLauncher.setSparkHome("/path/to/SPARK_HOME")
.setAppResource("/path/to/jar/to/be/executed")
.setMainClass("MainClassName")
.setMaster("MasterType like yarn or local[*]")
.setDeployMode("set deploy mode like cluster")
.setConf("spark.executor.cores","2")
// Lauch spark application
val sparkLauncher1 = sparkLauncher.startApplication()
//get jobId
val jobAppId = sparkLauncher1.getAppId
//Get status of job launched.THis loop will continuely show statuses like RUNNING,SUBMITED etc.
while (true) {
println(sparkLauncher1.getState().toString)
}
However, the challenge is to track each of them for failure and automatic recovery. It may be tricky specially when partial data is already written into HBase. i.e. A job fails to process the complete set of data assigned to it. You may have to automatically clean the data from HBase before automatically retrigger.

Spark RDD do not get processed in multiple nodes

I have a use case where in i create rdd from a hive table. I wrote a business logic that operates on every row in the hive table. My assumption was that when i create rdd and span a map process on it, it then utilises all my spark executors. But, what i see in my log is only one node process the rdd while rest of my 5 nodes sitting idle. Here is my code
val flow = hiveContext.sql("select * from humsdb.t_flow")
var x = flow.rdd.map { row =>
< do some computation on each row>
}
Any clue where i go wrong?
As specify here by #jaceklaskowski
By default, a partition is created for each HDFS partition, which by
default is 64MB (from Spark’s Programming Guide).
If your input data is less than 64MB (and you are using HDFS) then by default only one partition will be created.
Spark will use all nodes when using big data
Could there be a possibility that your data is skewed?
To rule out this possibility, do the following and rerun the code.
val flow = hiveContext.sql("select * from humsdb.t_flow").repartition(200)
var x = flow.rdd.map { row =>
< do some computation on each row>
}
Further if in your map logic you are dependent on a particular column you can do below
val flow = hiveContext.sql("select * from humsdb.t_flow").repartition(col("yourColumnName"))
var x = flow.rdd.map { row =>
< do some computation on each row>
}
A good partition column could be date column

Spark Streaming: How to periodically refresh cached RDD?

In my Spark streaming application, I want to map a value based on a dictionary that's retrieved from a backend (ElasticSearch). I want to periodically refresh the dictionary periodically, in case it was updated in the backend. It would be similar to Logstash translate filter's periodic refresh capability. How could I achieve this with Spark (e.g. somehow unpersist the RDD every 30 seconds)?
The best way I've found to do that is to recreate the RDD and maintain a mutable reference to it. Spark Streaming is at its core an scheduling framework on top of Spark. We can piggy-back on the scheduler to have the RDD refreshed periodically. For that, we use an empty DStream that we schedule only for the refresh operation:
def getData():RDD[Data] = ??? function to create the RDD we want to use af reference data
val dstream = ??? // our data stream
// a dstream of empty data
val refreshDstream = new ConstantInputDStream(ssc, sparkContext.parallelize(Seq())).window(Seconds(refreshInterval),Seconds(refreshInterval))
var referenceData = getData()
referenceData.cache()
refreshDstream.foreachRDD{_ =>
// evict the old RDD from memory and recreate it
referenceData.unpersist(true)
referenceData = getData()
referenceData.cache()
}
val myBusinessData = dstream.transform(rdd => rdd.join(referenceData))
... etc ...
In the past, I've also tried only with interleaving cache() and unpersist() with no result (it refreshes only once). Recreating the RDD removes all lineage and provides a clean load of the new data.
Steps :
Cache once before Starting streaming
Clear cache after a certain period (example here for 30 minutes)
Optional: Hive table repair via spark can be added to init.
spark.sql("msck repair table tableName")
import java.time.LocalDateTime
var caching_data = Data.init()
caching_data.persist()
var currTime = LocalDateTime.now()
var cacheClearTime = currTime.plusMinutes(30) // Put your time in Units
DStream.foreachRDD(rdd => if (rdd.take(1).length > 0) {
//Clear and Cache again
currTime = LocalDateTime.now()
val dateDiff = cacheClearTime.isBefore(currTime)
if (dateDiff) {
caching_data.unpersist(true) //blocking unpersist on boolean = true
caching_data = Data.init()
caching_data.persist()
currTime = LocalDateTime.now()
cacheClearTime = currTime.plusMinutes(30)
}
})

In spark Streaming how to reload a lookup non stream rdd after n batches

Suppose i have a streaming context which does lot of steps and then at the end the micro batch look's up or joins to a preloaded RDD. I have to refresh that preloaded RDD every 12 hours . how can i do this. Anything i do which does not relate to streaming context is not replayed to my understanding, how i get this called form one of the streaming RDD. I need to make only one call non matter how many partition the streaming dstream has
This is possible by re-creating the external RDD at the time it needs to be reloaded. It requires defining a mutable variable to hold the RDD reference that's active at a given moment in time. Within the dstream.foreachRDD we can then check for the moment when the RDD reference needs to be refreshed.
This is an example on how that would look like:
val stream:DStream[Int] = ??? //let's say that we have some DStream of Ints
// Some external data as an RDD of (x,x)
def externalData():RDD[(Int,Int)] = sparkContext.textFile(dataFile)
.flatMap{line => try { Some((line.toInt, line.toInt)) } catch {case ex:Throwable => None}}
.cache()
// this mutable var will hold the reference to the external data RDD
var cache:RDD[(Int,Int)] = externalData()
// force materialization - useful for experimenting, not needed in reality
cache.count()
// a var to count iterations -- use to trigger the reload in this example
var tick = 1
// reload frequency
val ReloadFrequency = 5
stream.foreachRDD{ rdd =>
if (tick == 0) { // will reload the RDD every 5 iterations
// unpersist the previous RDD, otherwise it will linger in memory, taking up resources.
cache.unpersist(false)
// generate a new RDD
cache = externalData()
}
// join the DStream RDD with our reference data, do something with it...
val matches = rdd.keyBy(identity).join(cache).count()
updateData(dataFile, (matches + 1).toInt) // so I'm adding data to the static file in order to see when the new records become alive
tick = (tick + 1) % ReloadFrequency
}
streaming.start
Previous to come with this solution, I studied the possibility to play with the persist flag in the RDD, but it didn't work as expected. Looks like unpersist() does not force re-materialization of the RDD when it's used again.

Can data be loaded in Apache Spark RDD/Dataframe on the fly?

Can data be loaded on the fly or does it have be pre-loaded into the RDD/DataFrame?
Say I have a SQL database and I use the JDBC source to load 1,000,000 records into the RDD. If for example a new records comes in the DB, can I write a job that will add that 1 new record the RDD/Dataframe to make it 1,000,001? Or does the entire RDD/DataFrame have to be rebuilt?
I guess it depends on what you mean by add (...) record and rebuilt. It is possible to use SparkContext.union or RDD.union to merge RDDs and DataFrame.unionAll to merge DataFrames.
As long as RDDs, which are merged, use the same serializer there is no need for reserialization but, if the same partitioner is used for both, it will require repartitioning.
Using JDBC source as an example:
import org.apache.spark.sql.functions.{max, lit}
val pMap = Map("url" -> "jdbc:..", "dbtable" -> "test")
// Load first batch
val df1 = sqlContext.load("jdbc", pMap).cache
// Get max id and trigger cache
val maxId = df1.select(max($"id")).first().getInt(0)
// Some inserts here...
// Get new records
val dfDiff = sqlContext.load("jdbc", pMap).where($"id" > lit(maxId))
// Combine - only dfDiff has to be fetched
// Should be cached as before
df1.unionAll(dfDiff)
If you need an updatable data structure IndexedRDD implements key-value store on Spark.

Resources