How to make Task Serializable in HBase using Spark - apache-spark

I was trying to write data in HBase using Spark but getting the exception Exception in thread "main" org.apache.spark.SparkException: Task not serializable. I was trying to open connection on each worker node using the following code snippet:
val conf = HBaseConfiguration.create()
val tableName = args(1)
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val admin = new HBaseAdmin(conf)
val tableDesc = new HTableDescriptor(tableName)
val columnDesc = new HColumnDescriptor("cf".getBytes()).setBloomFilterType(BloomType.ROWCOL).setMaxVersions(5)
tableDesc.addFamily(columnDesc)
admin.createTable(tableDesc)
rddData.foreachPartition( part => {
val table = new HTable(conf, tableName)
part.foreach( elem => {
var put = new Put(Bytes.toBytes(elem._1))
put.add(Bytes.toBytes("cf"), Bytes.toBytes("col"), Bytes.toBytes(elem._2))
table.put(put)
})
table.flushCommits()
})
How can I make task serializable while writing on HBase using spark?

If I am not mistaken conf (instance of hadoop Configuration) is not serializable.
Write your code in such a way that all the non-serializable parts are in the foreachPartition block (so that it is executed on the nodes). Here is an example where I create a second conf etc..:
`
rddData.foreachPartition( part => {
val conf2 = HBaseConfiguration.create()
val tableName2 = args(1)
conf2.set(TableInputFormat.INPUT_TABLE, tableName2)
val table2 = new HTable(conf2, tableName2)
part.foreach( elem => {
var put = new Put(Bytes.toBytes(elem._1))
put.add(Bytes.toBytes("cf"), Bytes.toBytes("col"), Bytes.toBytes(elem._2))
table2.put(put)
})
table2.flushCommits()
})
`

Related

Spark Cassandra connection through java client

I want to connect to my scylla db/cassandra through spark job & execute lookup query using java client. I tried following
val spark = SparkSession.builder.appName("ScyllaSparkClient")
.master("local[1]")
.getOrCreate()
import spark.implicits._
val m = Map( "John" -> 2 )
val df = m.toSeq.toDF("first", "id")
df.show
val vdf = df.mapPartitions(p => {
val cluster = Cluster.builder.addContactPoints("127.0.0.1").build
val session = cluster.connect("MyKeySpace")
val res = p.map(record => {
val results = session.execute(s"SELECT * FROM MyKeySpace.MyColumns where id='${record.get(1)}' and first='${record.get(0)}'")
val row = results.one()
var scyllaRow: Person = null
if (row != null) {
scyllaRow = Person(row.getString("id").toInt, row.getString("first"), row.getString("last"))
}
scyllaRow
})
session.close()
cluster.close()
res
})
vdf.show()
But come across host not available exception (though there are not connection issues, it works fine with java client)
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried)
at com.datastax.driver.core.RequestHandler.reportNoMoreHosts(RequestHandler.java:210)
at com.datastax.driver.core.RequestHandler.access$1000(RequestHandler.java:46)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.findNextHostAndQuery(RequestHandler.java:274)
at com.datastax.driver.core.RequestHandler.startNewExecution(RequestHandler.java:114)
at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:94)
at com.datastax.driver.core.SessionManager.executeAsync(SessionManager.java:132)
... 27 more
Any help is appreciated.
You need to use the Spark Cassandra connector to connect to a Cassandra database from Spark.
The connector is available from here -- https://github.com/datastax/spark-cassandra-connector. But since you're connecting to a Scylla DB, you'll likely need to use Scylla's fork of the connector. Cheers!
Use 'CassandraConnector' from com.datastax.spark.connector.cql.CassandraConnector
It will take care of session management for each partitions.
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.appName("ScyllaSparkClient")
.config("spark.cassandra.connection.host", "localhost")
.master("local[1]")
.getOrCreate()
import spark.implicits._
val m = Map( "John" -> 2 )
val df = m.toSeq.toDF("first", "id")
df.show
val connector = CassandraConnector(spark.sparkContext.getConf)
val vdf = df.mapPartitions(p => {
connector.withSessionDo { session =>
val res = p.map(record => {
val results = session.execute(s"SELECT * FROM MyKeySpace.MyColumns where id='${record.get(1)}' and first='${record.get(0)}'")
val row = results.one()
var scyllaRow: Person = null
if (row != null) {
scyllaRow = Person(row.getString("id").toInt, row.getString("first"), row.getString("last"))
}
scyllaRow
})
res
}
})
vdf.show()
}
It will work!

spark streaming hbase error

I want to insert streaming data into hbase;
this is my code :
val tableName = "streamingz"
val conf = HBaseConfiguration.create()
conf.addResource(new Path("file:///opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/etc/hbase/conf.dist/hbase-site.xml"))
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val admin = new HBaseAdmin(conf)
if (!admin.isTableAvailable(tableName)) {
print("-----------------------------------------------------------------------------------------------------------")
val tableDesc = new HTableDescriptor(tableName)
tableDesc.addFamily(new HColumnDescriptor("z1".getBytes()))
tableDesc.addFamily(new HColumnDescriptor("z2".getBytes()))
admin.createTable(tableDesc)
} else {
print("Table already exists!!--------------------------------------------------------------------------------------")
}
val ssc = new StreamingContext(sc, Seconds(10))
val topicSet = Set("fluxAstellia")
val kafkaParams = Map[String, String]("metadata.broker.list" - > "10.32.201.90:9092")
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
val lines = stream.map(_._2).map(_.split(" ", -1)).foreachRDD(rdd => {
if (!rdd.partitions.isEmpty) {
val myTable = new HTable(conf, tableName)
rdd.map(rec => {
var put = new Put(rec._1.getBytes)
put.add("z1".getBytes(), "name".getBytes(), Bytes.toBytes(rec._2))
myTable.put(put)
}).saveAsNewAPIHadoopDataset(conf)
myTable.flushCommits()
} else {
println("rdd is empty")
}
})
ssc.start()
ssc.awaitTermination()
}
}
I got this error:
:66: error: value _1 is not a member of Array[String]
var put = new Put(rec._1.getBytes)
I'm beginner so how I can't fix this error, and I have a question:
where exactly create the table; outside the streaming process or inside ?
Thank you
You error is basically on line var put = new Put(rec._1.getBytes)
You can call _n only on a Map(_1 for key and _2 for value) or a Tuple.
rec is a String Array you got by splitting the string in the stream by space characters. If you were after first element, you'd write it as var put = new Put(rec(0).getBytes). Likewise in the next line you'd write it as put.add("z1".getBytes(), "name".getBytes(), Bytes.toBytes(rec(1)))

How to perform multi threading or parallel processing in spark implemented in scala

Hi am having a spark streaming program which is reading the events from eventhub and pushing it topics. for processing each batch it is taking almost 10 times the batch time.
when am trying to implement multithreading am not able to see much difference in the processing time.
Is there any way by which I can increase the performance either by doing parallel processing. or start some 1000 threads at a time and just keep pushing the messages.
class ThreadExample(msg:String) extends Thread{
override def run {
var test = new PushToTopicDriver(msg)
test.push()
// println(msg)
}
}
object HiveEventsDirectStream {
def b2s(a: Array[Byte]): String = new String(a)
def main(args: Array[String]): Unit = {
val eventhubnamespace = "namespace"
val progressdir = "/Event/DirectStream/"
val eventhubname_d = "namespacestream"
val ehParams = Map[String, String](
"eventhubs.policyname" -> "PolicyKeyName",
"eventhubs.policykey" -> "key",
"eventhubs.namespace" -> "namespace",
"eventhubs.name" -> "namespacestream",
"eventhubs.partition.count" -> "30",
"eventhubs.consumergroup" -> "$default",
"eventhubs.checkpoint.dir" -> "/EventCheckpoint_0.1",
"eventhubs.checkpoint.interval" -> "2"
)
println("testing spark")
val conf = new SparkConf().set("spark.serializer", "org.apache.spark.serializer.KryoSerializer").setMaster("local[4]").setAppName("Eventhubs_Test")
conf.registerKryoClasses(Array(classOf[PublishToTopic]))
conf.set("spark.streaming.stopGracefullyOnShutdown", "true")
val sc= new SparkContext(conf)
val hiveContext = new HiveContext(sc)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val pool:ExecutorService=Executors.newFixedThreadPool(30)
val ssc = new StreamingContext(sc, Seconds(2))
var dataString :RDD[String] =sc.emptyRDD
val stream=EventHubsUtils.createDirectStreams(ssc,eventhubnamespace,progressdir,Map(eventhubname_d -> ehParams))
val kv1 = stream.map(receivedRecord => (new String(receivedRecord.getBody))).persist()
kv1.foreachRDD(rdd_1 => rdd_1.foreachPartition(line => line.foreach(msg => {var t1 = new ThreadExample(msg) t1.start()})))
ssc.start()
ssc.awaitTermination()
}
}
Thanks,
Ankush Reddy.

Convert a RDD into DataFrame after foreachRDD operation

I am processing logs which using Spark Streaming. I parse the log and convert the logs into Java Map. Following is the code.
Now I want to convert this Map into DataFrames
Any suggestion how achieve this?
val sparkConf = new SparkConf().setAppName("StreamingApp").setMaster("local[2]")
sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(2))
sqlContext= new SQLContext(sc)
val lines = ssc.textFileStream("hdfs://localhost:9000/test")
process(lines)
def process(lines: DStream[String]) {
val maptorow = lines.foreachRDD(rdd=>{
rdd.map(line => getMap(line))
.map(p =>
Row(p.get("column1"),
p.get("column2"))
}) // how to get dataframe after this?
def getMap(logs: String): java.util.Map[String, Object] = {
val k : java.util.Map[String, String] = parseLog(logs)
}
}
Thanks
foreachRDD has no return type, hence, you shouldn't be saving maptorow and in order for you to convert it, you need to do the conversion inside the foreachRDD and then deal with each RDD by itself as a separate set of data
val sqlContext = new SQLContext(sparkContext)
lines.foreachRDD(rdd=>{
import sqlContext.implicits._
val maptorow = lines.foreachRDD(rdd=>{
val newRDD = rdd.map(line => getMap(line))
.map(p =>
Row(p.get("column1"),
p.get("column2"))
val myDataFrame = newRDD.toDF()
//process myDataFrame as a DF
})

How can use spark SqlContext object in spark sql registeredFunction?

I am new to Spark SQL. Concat function not available in Spark Sql Query for this we have registered one sql function, with in this function i need access another table. for that we have written spark sql query on SQLContext object.
when i invoke this query i am getting NullpointerException.please can you help on this.
Thanks in advance
//This I My code
class SalesHistory_2(sqlContext:SQLContext,sparkContext:SparkContext) extends Serializable {
import sqlContext._
import sqlContext.createSchemaRDD
try{
sqlContext.registerFunction("MaterialTransformation", Material_Transformation _)
def Material_Transformation(Material_ID: String): String =
{
var material:String =null;
var dd = sqlContext.sql("select * from product_master")
material
}
/* Product master*/
val productRDD = this.sparkContext.textFile("D:\\Realease 8.0\\files\\BHI\\BHI_SOP_PRODUCT_MASTER.txt")
val product_schemaString = productRDD.first
val product_withoutHeaders = dropHeader(productRDD)
val product_schema = StructType(product_schemaString.split("\\|").map(fieldName => StructField(fieldName, StringType, true)))
val productdata = product_withoutHeaders.map{_.replace("|", "| ")}.map(x=> x.split("\\|"))
var product_rowRDD = productdata.map(line=>{
Row.fromSeq(line.map {_.trim() })
})
val product_srctableRDD = sqlContext.applySchema(product_rowRDD, product_schema)
product_srctableRDD.registerTempTable("product_master")
cacheTable("product_master")
/* Customer master*/
/* Sales History*/
val srcRDD = this.sparkContext.textFile("D:\\Realease 8.0\\files\\BHI\\BHI_SOP_TRADE_SALES_HISTORY_DS_4_20150119.txt")
val schemaString= srcRDD.first
val withoutHeaders = dropHeader(srcRDD)
val schema = StructType(schemaString.split("\\|").map(fieldName => StructField(fieldName, StringType, true)))
val lines = withoutHeaders.map {_.replace("|", "| ")}.map(x=> x.split("\\|"))
var rowRDD = lines.map(line=>{
Row.fromSeq(line.map {_.trim() })
})
val srctableRDD = sqlContext.applySchema(rowRDD, schema)
srctableRDD.registerTempTable("SALES_HISTORY")
val srcResults = sqlContext.sql("SELECT Delivery_Number,Delivery_Line_Item,MaterialTransformation(Material_ID),Customer_Group_Node,Ops_ID,DC_ID,Mfg_ID,PGI_Date,Delivery_Qty,Customer_Group_Node,Line_Total_COGS,Line_Net_Rev,Material_Description,Sold_To_Partner_Name,Plant_Description,Originating_Doc,Orig_Doc_Line_item,Revenue_Type,Material_Doc_Ref,Mater_Doc_Ref_Item,Req_Delivery_Date FROM SALES_HISTORY")
val path: Path = Path ("D:/Realease 8.0/files/output/")
try {
path.deleteRecursively(continueOnFailure = false)
} catch {
case e: IOException => // some file could not be deleted
}
val successRDDToFile = srcResults.map { x => x.mkString("|")}
successRDDToFile.coalesce(1).saveAsTextFile("D:/Realease 8.0/files/output/")
}
catch {
case ex: Exception => println(ex) // TODO: handle error
}
this.sparkContext.stop()
def dropHeader(data: RDD[String]): RDD[String] = {
data.mapPartitionsWithIndex((idx, lines) => {
if (idx == 0) {
lines.drop(1)
}
lines
})
}
The answer here is rather short and probably disappointing - you simply cannot do something like this.
General rule in Spark is you cannot trigger action or transformation from another action and transformation or, to be a little bit more precise, outside the driver Spark Context is no longer accessible / defined.
Calling Spark SQL for each row in the Sales History RDD looks like a very bad idea:
val srcResults = sqlContext.sql("SELECT Delivery_Number,Delivery_Line_Item,MaterialTransformation(Material_ID),Customer_Group_Node,Ops_ID,DC_ID,Mfg_ID,PGI_Date,Delivery_Qty,Customer_Group_Node,Line_Total_COGS,Line_Net_Rev,Material_Description,Sold_To_Partner_Name,Plant_Description,Originating_Doc,Orig_Doc_Line_item,Revenue_Type,Material_Doc_Ref,Mater_Doc_Ref_Item,Req_Delivery_Date FROM SALES_HISTORY")
You'd better user a join between your RDDs and forget you custom function:
val srcResults = sqlContext.sql("SELECT s.*, p.* FROM SALES_HISTORY s join product_master p on s.Material_ID=p.ID")

Resources