How to use spark to write to HBase using multi-thread - apache-spark

I'm using spark to write data to HBase, but at the writing stage, only one executor and one core are executing.
I wonder why my code is not writing properly or what should I do to make it write faster?
Here is my code:
val df = ss.sql("SQL")
HBaseTableWriterUtil.hbaseWrite(ss, tableList, df)
def hbaseWrite(ss:SparkSession,tableList: List[String], df:DataFrame): Unit ={
val tableName = tableList(0)
val rowKeyName = tableList(4)
val rowKeyType = tableList(5)
hbaseConf.set(TableOutputFormat.OUTPUT_TABLE, s"${tableName}")
//写入到HBase
val sc = ss.sparkContext
sc.hadoopConfiguration.addResource(hbaseConf)
val columns = df.columns
val result = df.rdd.mapPartitions(par=>{
par.map(row=>{
var rowkey:String =""
if("String".equals(rowKeyType)){
rowkey = row.getAs[String](rowKeyName)
}else if("Long".equals(rowKeyType)){
rowkey = row.getAs[Long](rowKeyName).toString
}
val put = new Put(Bytes.toBytes(rowkey))
for(name<-columns){
var value = row.get(row.fieldIndex(name))
if(value!=null){
put.addColumn(Bytes.toBytes("cf"),Bytes.toBytes(name),Bytes.toBytes(value.toString))
}
}
(new ImmutableBytesWritable,put)
})
})
val job = Job.getInstance(sc.hadoopConfiguration)
job.setOutputKeyClass(classOf[ImmutableBytesWritable])
job.setOutputValueClass(classOf[Result])
job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
result.saveAsNewAPIHadoopDataset(job.getConfiguration)
}

You may not control how many parallel execute may write to HBase.
Though you can start multiple Spark jobs in multiThreaded client program.
e.g. You can have a shell script which triggers multiple spark-submit command to induce parallelism. Each spark job can work on one set of data independent to each other and push into HBase.
This can also be done using Spark Java/Scala SparkLauncher API using it with Java concurrent API (e.g. Executor framework).
val sparkLauncher = new SparkLauncher
//Set Spark properties.only Basic ones are shown here.It will be overridden if properties are set in Main class.
sparkLauncher.setSparkHome("/path/to/SPARK_HOME")
.setAppResource("/path/to/jar/to/be/executed")
.setMainClass("MainClassName")
.setMaster("MasterType like yarn or local[*]")
.setDeployMode("set deploy mode like cluster")
.setConf("spark.executor.cores","2")
// Lauch spark application
val sparkLauncher1 = sparkLauncher.startApplication()
//get jobId
val jobAppId = sparkLauncher1.getAppId
//Get status of job launched.THis loop will continuely show statuses like RUNNING,SUBMITED etc.
while (true) {
println(sparkLauncher1.getState().toString)
}
However, the challenge is to track each of them for failure and automatic recovery. It may be tricky specially when partial data is already written into HBase. i.e. A job fails to process the complete set of data assigned to it. You may have to automatically clean the data from HBase before automatically retrigger.

Related

Spark mapPartitions Issue

I am using spark mapPartition on my DF and the use case i should submit one Job (either calling lambda or sending a SQS Message) for each Partition.
I am partitioning on a custom formatted date column and logging the no.of partitions before and after and it is working as expected.
How ever when i see the total no.of jobs it is more than the no.of partitions. For Some of the partitions there are two or three jobs !!
Here is the Code i am using
val yearMonthQueryRDD = yearMonthQueryDF.rdd.mapPartitions(
partition => {
val partitionObjectList = new java.util.ArrayList[String]()
logger.info("partitionIndex = {}",TaskContext.getPartitionId());
val partitionCounter:AtomicLong = new AtomicLong(0)
val partitionSize:AtomicLong = new AtomicLong(0)
val paritionColumnName:AtomicReference[String] = new AtomicReference[String]();
// Iterate the Objects in a given parittion
val updatedPartition = partition.map( record => {
import yearMonthQueryDF.sparkSession.implicits._
partitionCounter.set(partitionCounter.get()+1)
val recordSizeInt = Integer.parseInt(record.getAs("object_size"))
val recordSize:Long = recordSizeInt.toLong
partitionObjectList.add(record.getAs("object_key"))
paritionColumnName.set(record.getAs("partition_column_name"))
record
}
).toList
logger_ref.info("No.of Elements in Partition ["+paritionColumnName.get()+"] are =["+partitionCounter.get()+"] Total Size=["+partitionSize.get()+"]")
// Submit a Job for the parition
// jobUtil.submitJob(paritionColumnName.get(),partitionObjectList,partitionSize.get())
updatedPartition.toIterator
}
)
Another thing that is making the debugging harder is the logging statements inside the mapPartitions() method are not found in the container error logs (since they are executed on each worker node not on master node i expected them to find them in container logs rather than in master node logs. Need to figure why i am only seeing stderr logs but not stdout logs on the containers though).
Thanks
Sateesh

Spark Structured Streaming - testing one batch at a time

I'm trying to create a test for a custom MicroBatchReadSupport DataSource which I've implemented.
For that, I want to invoke one batch at a time, which will read the data using this DataSource(I've created appropriate mocks). I want to invoke a batch, verify that the correct data was read (currently by saving it to a memory sink and checking the output), and only then invoke the next batch and verify it's output.
I couldn't find a way to invoke each batch after the other.
If I use streamingQuery.processAllAvailable(), the batches are invoked one after the other, without allowing me to verify the output for each one separately. Using trigger(Trigger.Once()) doesn't help as well, because it executes one batch and I can't continue to the next one.
Is there any way to do what I want?
Currently this is my basic code:
val dataFrame = sparkSession.readStream.format("my-custom-data-source").load()
val dsw: DataStreamWriter[Row] = dataFrame.writeStream
.format("memory")
.queryName("test_output")
val streamingQuery = dsw
.start()
streamingQuery.processAllAvailable()
What I've ended up doing is setting up the test with a DataStreamWriter which runs once, but saves the current status to a checkpoint. So each time we invoke dsw.start(), the new batch is resumed from the latest offset, according to the checkpoint. I'm also saving the data into a globalTempView, so I will be able to query the data in a similar way to using the memory sink. For doing that, I'm using foreachBatch (which is only available since Spark 2.4).
This is in code:
val dataFrame = sparkSession.readStream.format("my-custom-data-source").load()
val dsw = getNewDataStreamWriter(dataFrame)
testFirstBatch(dsw)
testSecondBatch(dsw)
private def getNewDataStreamWriter(dataFrame: DataFrame) = {
val checkpointTempDir = Files.createTempDirectory("tests").toAbsolutePath.toString
val dsw: DataStreamWriter[Row] = dataFrame.writeStream
.trigger(Trigger.Once())
.option("checkpointLocation", checkpointTempDir)
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.createOrReplaceGlobalTempView("input_data")
}
dsw
}
And the actual test code for each batch (e.g. testFirstBatch) is:
val rows = processNextBatch(dsw)
assertResult(10)(rows.length)
private def processNextBatch(dsw: DataStreamWriter[Row]) = {
val streamingQuery = dsw
.start()
streamingQuery.processAllAvailable()
sparkSession.sql("select * from global_temp.input_data").collect()
}

How to refresh a dataframe in spark structured streaming [duplicate]

In my Spark streaming application, I want to map a value based on a dictionary that's retrieved from a backend (ElasticSearch). I want to periodically refresh the dictionary periodically, in case it was updated in the backend. It would be similar to Logstash translate filter's periodic refresh capability. How could I achieve this with Spark (e.g. somehow unpersist the RDD every 30 seconds)?
The best way I've found to do that is to recreate the RDD and maintain a mutable reference to it. Spark Streaming is at its core an scheduling framework on top of Spark. We can piggy-back on the scheduler to have the RDD refreshed periodically. For that, we use an empty DStream that we schedule only for the refresh operation:
def getData():RDD[Data] = ??? function to create the RDD we want to use af reference data
val dstream = ??? // our data stream
// a dstream of empty data
val refreshDstream = new ConstantInputDStream(ssc, sparkContext.parallelize(Seq())).window(Seconds(refreshInterval),Seconds(refreshInterval))
var referenceData = getData()
referenceData.cache()
refreshDstream.foreachRDD{_ =>
// evict the old RDD from memory and recreate it
referenceData.unpersist(true)
referenceData = getData()
referenceData.cache()
}
val myBusinessData = dstream.transform(rdd => rdd.join(referenceData))
... etc ...
In the past, I've also tried only with interleaving cache() and unpersist() with no result (it refreshes only once). Recreating the RDD removes all lineage and provides a clean load of the new data.
Steps :
Cache once before Starting streaming
Clear cache after a certain period (example here for 30 minutes)
Optional: Hive table repair via spark can be added to init.
spark.sql("msck repair table tableName")
import java.time.LocalDateTime
var caching_data = Data.init()
caching_data.persist()
var currTime = LocalDateTime.now()
var cacheClearTime = currTime.plusMinutes(30) // Put your time in Units
DStream.foreachRDD(rdd => if (rdd.take(1).length > 0) {
//Clear and Cache again
currTime = LocalDateTime.now()
val dateDiff = cacheClearTime.isBefore(currTime)
if (dateDiff) {
caching_data.unpersist(true) //blocking unpersist on boolean = true
caching_data = Data.init()
caching_data.persist()
currTime = LocalDateTime.now()
cacheClearTime = currTime.plusMinutes(30)
}
})

Spark Streaming: How to periodically refresh cached RDD?

In my Spark streaming application, I want to map a value based on a dictionary that's retrieved from a backend (ElasticSearch). I want to periodically refresh the dictionary periodically, in case it was updated in the backend. It would be similar to Logstash translate filter's periodic refresh capability. How could I achieve this with Spark (e.g. somehow unpersist the RDD every 30 seconds)?
The best way I've found to do that is to recreate the RDD and maintain a mutable reference to it. Spark Streaming is at its core an scheduling framework on top of Spark. We can piggy-back on the scheduler to have the RDD refreshed periodically. For that, we use an empty DStream that we schedule only for the refresh operation:
def getData():RDD[Data] = ??? function to create the RDD we want to use af reference data
val dstream = ??? // our data stream
// a dstream of empty data
val refreshDstream = new ConstantInputDStream(ssc, sparkContext.parallelize(Seq())).window(Seconds(refreshInterval),Seconds(refreshInterval))
var referenceData = getData()
referenceData.cache()
refreshDstream.foreachRDD{_ =>
// evict the old RDD from memory and recreate it
referenceData.unpersist(true)
referenceData = getData()
referenceData.cache()
}
val myBusinessData = dstream.transform(rdd => rdd.join(referenceData))
... etc ...
In the past, I've also tried only with interleaving cache() and unpersist() with no result (it refreshes only once). Recreating the RDD removes all lineage and provides a clean load of the new data.
Steps :
Cache once before Starting streaming
Clear cache after a certain period (example here for 30 minutes)
Optional: Hive table repair via spark can be added to init.
spark.sql("msck repair table tableName")
import java.time.LocalDateTime
var caching_data = Data.init()
caching_data.persist()
var currTime = LocalDateTime.now()
var cacheClearTime = currTime.plusMinutes(30) // Put your time in Units
DStream.foreachRDD(rdd => if (rdd.take(1).length > 0) {
//Clear and Cache again
currTime = LocalDateTime.now()
val dateDiff = cacheClearTime.isBefore(currTime)
if (dateDiff) {
caching_data.unpersist(true) //blocking unpersist on boolean = true
caching_data = Data.init()
caching_data.persist()
currTime = LocalDateTime.now()
cacheClearTime = currTime.plusMinutes(30)
}
})

apache spark running task on each rdd

I have a rdd which is distributed accross multiple machines in a spark environment. I would like to execute a function on each worker machine on this rdd.
I do not want to collect the rdd and then execute a function on the driver. The function should be executed seperately on each executors for their own rdd.
How can I do that
Update (adding code)
I am running all this in spark shell
import org.apache.spark.sql.cassandra.CassandraSQLContext
import java.util.Properties
val cc = new CassandraSQLContext(sc)
val rdd = cc.sql("select * from sams.events where appname = 'test'");
val df = rdd.select("appname", "assetname");
Here I have a df with 400 rows. I need to save this df to sql server table. When I try to use df.write method it gives me errors which I have posted in a separate thread
spark dataframe not appending to the table
I can open a driverManager conection and insert rows but that will be done in the driver module of spark
import java.sql._
import com.microsoft.sqlserver.jdbc.SQLServerDriver
// create a Statement from the connection
Statement statement = conn.createStatement();
// insert the data
statement.executeUpdate("INSERT INTO Customers " + "VALUES (1001, 'Simpson', 'Mr.', 'Springfield', 2001)");
String connectionUrl = "jdbc:sqlserver://localhost:1433;" +
"databaseName=AdventureWorks;user=MyUserName;password=*****;";
Connection con = DriverManager.getConnection(connectionUrl);
I need to do this writing in the executor machine. How can I achieve this?
In order to setup connections from workers to other systems, we should use rdd.foreachPartitions(iter => ...)
foreachPartitions lets you execute an operation for each partition, giving you access to the data of the partition as a local iterator.
With enough data per partition, the time of setting up resources (like db connections) is amortized by using such resources over a whole partition.
abstract eg.
rdd.foreachPartition(iter =>
//setup db connection
val dbconn = Driver.connect(ip, port)
iter.foreach{element =>
val query = makeQuery(element)
dbconn.execute(query)
}
dbconn.close
}
It's also possible to create singleton resource managers that manage those resources for each JVM of the cluster. See also this answer for a complete example of such local resource manager: spark-streaming and connection pool implementation

Resources