How to write to HBase from Spark Streaming - apache-spark

Referring to my other question "Writing to HBase from Spark Streaming", I was advised to follow https://www.mapr.com/blog/spark-streaming-hbase in order to write to HBase from Spark Streaming and that's what I did (with modification according to my need), when running Spark-submit, there is no error but also that data is not written into HBase, I'll show you the code if you could please figure out if I did something wrong and how to correct it :
val conf = HBaseConfiguration.create()
val jobConfig: JobConf = new JobConf(conf)
jobConfig.setOutputFormat(classOf[TableOutputFormat])
jobConfig.set(TableOutputFormat.OUTPUT_TABLE, "tabName")
Dstream.foreachRDD(rdd =>
rdd.map(Convert.toPut).saveAsHadoopDataset(jobConfig))
with :
object Convert{
def toPut (parametre: (String,String)): (ImmutableBytesWritable, Put) = {
val put = new Put(Bytes.toBytes(1))
put.add(Bytes.toBytes("colfamily"), Bytes.toBytes(parametre._1), Bytes.toBytes(parametre._2))
return (new ImmutableBytesWritable(Bytes.toBytes(1)), put)
}
Could you please help me find out what I'm doing wrong here ?
Thank you in advance

Related

Spark - SparkSession access issue

I have a problem similar to one in
Spark java.lang.NullPointerException Error when filter spark data frame on inside foreach iterator
String_Lines.foreachRDD{line ->
line.foreach{x ->
// JSON to DF Example
val sparkConfig = SparkConf().setAppName("JavaKinesisWordCountASL").setMaster("local[*]").
set("spark.sql.warehouse.dir", "file:///C:/tmp")
val spark = SparkSession.builder().config(sparkConfig).orCreate
val outer_jsonData = Arrays.asList(x)
val outer_anotherPeopleDataset = spark.createDataset(outer_jsonData, Encoders.STRING())
spark.read().json(outer_anotherPeopleDataset).createOrReplaceTempView("jsonInnerView")
spark.sql("select name, address.city, address.state from jsonInnerView").show(false)
println("Current String #"+ x)
}
}
#thebluephantom did explain it to the point. I have my code in foreachRDD now, but still it doesn't work. This is Kotlin and I am running it in my local laptop with IntelliJ. Somehow it's not picking sparksession as I understand after reading all blogs. If I delete "spark.read and spark.sql", everything else works OK. What should I do to fix this?
If I delete "spark.read and spark.sql", everything else works OK
If you delete those, you're not actually making Spark do anything, only defining what Spark actions should happen (Spark actions are lazy)
Somehow it's not picking sparksession as I understand
It's "picking it up" just fine. The error is happening because it's picking up a brand new SparkSession. You should already have defined one of these outside of the forEachRDD method, but if you try to reuse it, you might run into different issues
Assuming String_Lines is already a Dataframe. There's no point in looping over all of its RDD data and trying to create brand new SparkSession. Or if it's a DStream, convert it to Streaming Dataframe instead...
That being said, you should be able to immediately select data from it
// unclear what the schema of this is
val selected = String_Lines.selectExpr("name", "address.city", "address.state")
selected.show(false)
You may need to add a get_json_object function in there if you're trying to parse strings to JSON
I am able to solve it finally.
I modified code like this.... Its clean and working.
This is String_Lines data type
val String_Lines: JavaDStream<String>
String_Lines.foreachRDD { x ->
val df = spark.read().json(x)
df.printSchema()
df.show(2,false)
}
Thanks,
Chandra

How to create dataframe inside ForeachWriter[Row]

I have a streaming query that I'm reading from Kafka as the source. I want to perform some logic on each batch that I receive from the stream. Here's how I have done it so far
val streamDF = spark
.readStream
...
.load()
//val bc = spark.sparkContext.broadcast(spark)
streamDF
.writeStream
.foreach( new ForeachWriter[Row] {
def open(partitionId: Long, version: Long): Boolean = {true}
def process(record: String) = {
val aRDD = spark.sparkContext.parallelize(Seq('a','b','C'))
val aDF = spark.createDataframe(aRDD)
//val aDF = bc.vlaue.createDataframe(aRDD)
// do something with aDF
}
def close(errorOrNull: Throwable): Unit = {}
}
).start()
I'm using Spark 2.3.2 so I'm stuck with ForeachWriter (I cannot use foreachBatch, this would've made my life simpler). I'm also aware that the foreach() performs on executors.
So, keeping that in mind, I broadcasted sparkSession to all the executors. But that did not help either. This is the commented part of the code snippet.
I'm looking for a solution to process data as dataframe inside foreach in Spark 2.3.2 (I have to use dataframe/datasets as the operations are pretty heavy.. they include actions as well)
I found a similar question but there is no response on it --> similar q
Sorry, well not really, but NOT possible to create dataframe on an Executor.
A dataframe is a distributed collection in Spark. They are only able to be created on Driver node or via Transformation (via Actions) in your Spark App.

Spark dataframe returning only structure when connected to Phoenix query server

I am connecting to hbase ( ver 1.2) via phoenix (4.11) queryserver from Spark 2.2.0, but the dataframe is returning the only table structure with empty rows thoug data is present in table.
Here is the code I am using to connect to queryserver.
// ---jar ----phoenix-4.11.0-HBase-1.2-thin-client.jar<br>
val prop = new java.util.Properties
prop.setProperty("driver", "org.apache.phoenix.queryserver.client.Driver")
val url = "jdbc:phoenix:thin:url=http://localhost:8765;serialization=PROTOBUF"
val d1 = spark.sqlContext.read.jdbc(url,"TABLE1",prop)
d1.show()
Can anyone please help me in solving this issue. Thanks in advance
If you are using spark2.2 the better approach would be to load directly via pheonix as a dataframe.This way you would provide the zookeeper url only and you can provide a predicate so that you load only the data required and not the entire data.
import org.apache.phoenix.spark._
import org.apache.hadoop.conf.Configuration
importĀ org.apache.spark.sql.SparkSession
val configuration = new Configuration()
configuration.set("hbase.zookeeper.quorum", "localhost:2181");
valĀ spark = SparkSession.builder().master("local").enableHiveSupport().getOrCreate()
val df=spark.sqlContext.phoenixTableAsDataFrame("TABLE1",Seq("COL1","COL2"),predicate = Some("\"COL1\" = 1"),conf = configuration)
Read this for more info on getting table as rdd and saving dataframes and rdd's .

WriteConf of Spark-Cassandra Connector being used or not

I am using Spark version 1.6.2, Spark-Cassandra Connector 1.6.0, Cassandra-Driver-Core 3.0.3
I am writing a simple Spark job in which I am trying to insert some rows to a table in Cassandra. The code snippet used was:
val sparkConf = (new SparkConf(true).set("spark.cassandra.connection.host", "<Cassandra IP>")
.set("spark.cassandra.auth.username", "test")
.set("spark.cassandra.auth.password", "test")
.set("spark.cassandra.output.batch.size.rows", "1"))
val sc = new SparkContext(sparkConf)
val cassandraSQLContext = new CassandraSQLContext(sc)
cassandraSQLContext.setKeyspace("test")
val query = "select * from test"
val dataRDD = cassandraSQLContext.cassandraSql(query).rdd
val addRowList = (ListBuffer(
Test(111, 10, 100000, "{'test':'0','test1':'1','others':'2'}"),
Test(111, 20, 200000, "{'test':'0','test1':'1','others':'2'}")
))
val insertRowRDD = sc.parallelize(addRowList)
insertRowRDD.saveToCassandra("test", "test")
Test() is a case class
Now, I have passed the WriteConf parameter output.batch.size.rows when making sparkConf object. I am expecting that this code will write 1 row in a batch at a time in Cassandra. I am not getting any method through which I can cross verify that the configuration of writing a batch in cassandra is not the default one but the one passed in the code snippet.
I could not find anything in the cassandra cassandra.log, system.log and debug.log
So can anyone help me with the method of cross verifying the WriteConf being used by Spark-Cassandra Connector to write batches in Cassandra?
There are two things you can do to verify that your setting was correctly set.
First you can call the method which creates WriteConf
WriteConf.fromSparkConf(sparkConf)
The resulting object can be inspected to make sure all the values are what you want. This is the default arg to SaveToCassandra
You can explicitly pass a WriteConf to the saveToCassandraMethod
saveAsCassandraTable(keyspace, table, writeConf = WriteConf(...))

Spark Cassandra Connector: SQLContext.read + SQLContext.write vs. manual parsing and inserting (JSON -> Cassandra)

Good morning,
i just started investigating Apache Spark and Apache Cassandra. First step is a real simple use-case: taking a file containing e.g. customer + score.
Cassandra table has customer as PrimaryKey. Cassandra is just running locally (so no cluster at all!).
So the SparkJob (Standalone local[2]) is parsing the JSON file and then writing the whole stuff into Cassandra.
First solution was
val conf = new SparkConf().setAppName("Application").setMaster("local[2]")
val sc = new SparkContext(conf)
val cass = CassandraConnector(conf)
val customerScores = sc.textFile(file).cache()
val customerScoreRDD = customerScores.mapPartitions(lines => {
val mapper = new ObjectMapper with ScalaObjectMapper
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
mapper.registerModule(DefaultScalaModule)
lines
.map(line => {
mapper.readValue(line, classOf[CustomerScore])
})
//Filter corrupt ones: empty values
.filter(customerScore => customerScore.customer != null && customerScore.score != null)
})
customerScoreRDD.foreachPartition(rows => cass.withSessionDo(session => {
val statement: PreparedStatement = session.prepare("INSERT INTO playground.customer_score (customer,score) VALUES (:customer,:score)")
rows.foreach(row => {
session.executeAsync(statement.bind(row.customer.asInstanceOf[Object], row.score))
})
}))
sc.stop()
means doing everything manually, parsing the lines and then inserting into Cassandra.
This roughly takes about 714020 ms in total for 10000000 records (incl. creating SparkContext and so on ...).
Then i read about the spark-cassandra-connector and did the following:
val conf = new SparkConf().setAppName("Application").setMaster("local[2]")
val sc = new SparkContext(conf)
var sql = new SQLContext(sc)
val customerScores = sql.read.json(file)
val customerScoresCorrected = customerScores
//Filter corrupt ones: empty values
.filter("customer is not null and score is not null")
//Filter corrupt ones: invalid properties
.select("customer", "score")
customerScoresCorrected.write
.format("org.apache.spark.sql.cassandra")
.mode(SaveMode.Append)
.options(Map("keyspace" -> "playground", "table" -> "customer_score"))
.save()
sc.stop()
So much simpler in sense of needed code and using given API.
This solution roughly takes 1232871 ms for 10000000 records (again all in all, so same measure points).
(Had a third solution as well, parsing manually plus using saveToCassandra which takes 1530877 ms)
Now my question:
Which way is the "correct" way to fulfil this usecase, so which one is the "best practice" (and in a real scenario, clustered cassandra and spark, the most performing one) nowadays?
Cause from my results i would use the "manual" stuff instead of SQLContext.read + SQLContext.write.
Thanks for your comments and hints in advance.
Actually after playing around now a long time, following has to be considered.
Of course amount of data
Type of your data: especially variety of partition keys (each one different vs. lots of duplicates)
The environment: Spark Executors, Cassandra Nodes, Replication ...
For my UseCase playing around with
def initSparkContext: SparkContext = {
val conf = new SparkConf().setAppName("Application").setMaster("local[2]")
// since we have nearly totally different PartitionKeys, default: 1000
.set("spark.cassandra.output.batch.grouping.buffer.size", "1")
// write as much concurrently, default: 5
.set("spark.cassandra.output.concurrent.writes", "1024")
// batch same replica, default: partition
.set("spark.cassandra.output.batch.grouping.key", "replica_set")
val sc = new SparkContext(conf)
sc
}
did boost speed dramatically in my local run.
So there is very much need to try out the various parameters to get YOUR best way. At least that is the conclusion i got.

Resources