WriteConf of Spark-Cassandra Connector being used or not - apache-spark

I am using Spark version 1.6.2, Spark-Cassandra Connector 1.6.0, Cassandra-Driver-Core 3.0.3
I am writing a simple Spark job in which I am trying to insert some rows to a table in Cassandra. The code snippet used was:
val sparkConf = (new SparkConf(true).set("spark.cassandra.connection.host", "<Cassandra IP>")
.set("spark.cassandra.auth.username", "test")
.set("spark.cassandra.auth.password", "test")
.set("spark.cassandra.output.batch.size.rows", "1"))
val sc = new SparkContext(sparkConf)
val cassandraSQLContext = new CassandraSQLContext(sc)
cassandraSQLContext.setKeyspace("test")
val query = "select * from test"
val dataRDD = cassandraSQLContext.cassandraSql(query).rdd
val addRowList = (ListBuffer(
Test(111, 10, 100000, "{'test':'0','test1':'1','others':'2'}"),
Test(111, 20, 200000, "{'test':'0','test1':'1','others':'2'}")
))
val insertRowRDD = sc.parallelize(addRowList)
insertRowRDD.saveToCassandra("test", "test")
Test() is a case class
Now, I have passed the WriteConf parameter output.batch.size.rows when making sparkConf object. I am expecting that this code will write 1 row in a batch at a time in Cassandra. I am not getting any method through which I can cross verify that the configuration of writing a batch in cassandra is not the default one but the one passed in the code snippet.
I could not find anything in the cassandra cassandra.log, system.log and debug.log
So can anyone help me with the method of cross verifying the WriteConf being used by Spark-Cassandra Connector to write batches in Cassandra?

There are two things you can do to verify that your setting was correctly set.
First you can call the method which creates WriteConf
WriteConf.fromSparkConf(sparkConf)
The resulting object can be inspected to make sure all the values are what you want. This is the default arg to SaveToCassandra
You can explicitly pass a WriteConf to the saveToCassandraMethod
saveAsCassandraTable(keyspace, table, writeConf = WriteConf(...))

Related

Spark sql querying a Hive table from workers

I am trying to querying a Hive table from a map operation in Spark, but when it run a query the execution getting frozen.
This is my test code
val sc = new SparkContext(conf)
val datasetPath = "npiCodesMin.csv"
val sparkSession = SparkSession.builder().enableHiveSupport().getOrCreate()
val df = sparkSession.read.option("header", true).option("sep", ",").csv(datasetPath)
df.createOrReplaceTempView("npicodesTmp")
sparkSession.sql("DROP TABLE IF EXISTS npicodes");
sparkSession.sql("CREATE TABLE npicodes AS SELECT * FROM npicodesTmp");
val res = sparkSession.sql("SELECT * FROM npicodes WHERE NPI = '1588667638'") //This works
println(res.first())
val NPIs = sc.parallelize(List("1679576722", "1588667638", "1306849450", "1932102084"))//Some existing NPIs
val rows = NPIs.mapPartitions{ partition =>
val sparkSession = SparkSession.builder().enableHiveSupport().getOrCreate()
partition.map{code =>
val res = sparkSession.sql("SELECT * FROM npicodes WHERE NPI = '"+code+"'")//The program stops here
res.first()
}
}
rows.collect().foreach(println)
It loads the data from a CSV, creates a new Hive table and fills it with the CSV data.
Then, if I query the table from the master it works perfectly, but if I try to do that in a map operation the execution getting frozen.
It do not generate any error, it continue running without do anything.
The Spark UI shows this situation
Actually, I am not sure if I can query a table in a distributed way, I cannot find it in the documentation.
Any suggestion?
Thanks.

Spark dataframe returning only structure when connected to Phoenix query server

I am connecting to hbase ( ver 1.2) via phoenix (4.11) queryserver from Spark 2.2.0, but the dataframe is returning the only table structure with empty rows thoug data is present in table.
Here is the code I am using to connect to queryserver.
// ---jar ----phoenix-4.11.0-HBase-1.2-thin-client.jar<br>
val prop = new java.util.Properties
prop.setProperty("driver", "org.apache.phoenix.queryserver.client.Driver")
val url = "jdbc:phoenix:thin:url=http://localhost:8765;serialization=PROTOBUF"
val d1 = spark.sqlContext.read.jdbc(url,"TABLE1",prop)
d1.show()
Can anyone please help me in solving this issue. Thanks in advance
If you are using spark2.2 the better approach would be to load directly via pheonix as a dataframe.This way you would provide the zookeeper url only and you can provide a predicate so that you load only the data required and not the entire data.
import org.apache.phoenix.spark._
import org.apache.hadoop.conf.Configuration
importĀ org.apache.spark.sql.SparkSession
val configuration = new Configuration()
configuration.set("hbase.zookeeper.quorum", "localhost:2181");
valĀ spark = SparkSession.builder().master("local").enableHiveSupport().getOrCreate()
val df=spark.sqlContext.phoenixTableAsDataFrame("TABLE1",Seq("COL1","COL2"),predicate = Some("\"COL1\" = 1"),conf = configuration)
Read this for more info on getting table as rdd and saving dataframes and rdd's .

Reading data from Hbase using Get command in Spark

I want to read data from an Hbase table using get command while I have also the key of the row..I want to do that in my Spark streaming application, Is there any source code someone can share?
You can use Spark newAPIHadoopRDD to read Hbase table, which returns and RDD.
For example:
val sparkConf = new SparkConf().setAppName("Hbase").setMaster("local")
val sc = new SparkContext(sparkConf)
val conf = HBaseConfiguration.create()
val tableName = "table"
conf.set("hbase.master", "localhost:60000")
conf.set("hbase.zookeeper.quorum", "localhost:2181")
conf.set("zookeeper.znode.parent", "/hbase-unsecure")
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val rdd = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
println("Number of Records found : " + rdd.count())
sc.stop()
Or you can use any Spark Hbase connector like HortonWorks Hbase connector.
https://github.com/hortonworks-spark/shc
You can also use Spark-Phoenix API.
https://phoenix.apache.org/phoenix_spark.html

how to use Cassandra Context in spark 2.0

In previous Version of Spark like 1.6.1, i am using creating Cassandra Context using spark Context,
import org.apache.spark.{ Logging, SparkContext, SparkConf }
//config
val conf: org.apache.spark.SparkConf = new SparkConf(true)
.set("spark.cassandra.connection.host", CassandraHost)
.setAppName(getClass.getSimpleName)
lazy val sc = new SparkContext(conf)
val cassandraSqlCtx: org.apache.spark.sql.cassandra.CassandraSQLContext = new CassandraSQLContext(sc)
//Query using Cassandra context
cassandraSqlCtx.sql("select id from table ")
But In Spark 2.0 , Spark Context is replaced with Spark session, how can i use cassandra context?
Short Answer: You don't. It has been deprecated and removed.
Long Answer: You don't want to. The HiveContext provides everything except for the catalogue and supports a much wider range of SQL(HQL~). In Spark 2.0 this just means you will need to manually register Cassandra tables use createOrReplaceTempView until an ExternalCatalogue is implemented.
In Sql this looks like
spark.sql("""CREATE TEMPORARY TABLE words
|USING org.apache.spark.sql.cassandra
|OPTIONS (
| table "words",
| keyspace "test")""".stripMargin)
In the raw DF api it looks like
spark
.read
.format("org.apache.spark.sql.cassandra")
.options(Map("keyspace" -> "test", "table" -> "words"))
.load
.createOrReplaceTempView("words")
Both of these commands will register the table "words" for SQL queries.

Spark Cassandra Connector: SQLContext.read + SQLContext.write vs. manual parsing and inserting (JSON -> Cassandra)

Good morning,
i just started investigating Apache Spark and Apache Cassandra. First step is a real simple use-case: taking a file containing e.g. customer + score.
Cassandra table has customer as PrimaryKey. Cassandra is just running locally (so no cluster at all!).
So the SparkJob (Standalone local[2]) is parsing the JSON file and then writing the whole stuff into Cassandra.
First solution was
val conf = new SparkConf().setAppName("Application").setMaster("local[2]")
val sc = new SparkContext(conf)
val cass = CassandraConnector(conf)
val customerScores = sc.textFile(file).cache()
val customerScoreRDD = customerScores.mapPartitions(lines => {
val mapper = new ObjectMapper with ScalaObjectMapper
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
mapper.registerModule(DefaultScalaModule)
lines
.map(line => {
mapper.readValue(line, classOf[CustomerScore])
})
//Filter corrupt ones: empty values
.filter(customerScore => customerScore.customer != null && customerScore.score != null)
})
customerScoreRDD.foreachPartition(rows => cass.withSessionDo(session => {
val statement: PreparedStatement = session.prepare("INSERT INTO playground.customer_score (customer,score) VALUES (:customer,:score)")
rows.foreach(row => {
session.executeAsync(statement.bind(row.customer.asInstanceOf[Object], row.score))
})
}))
sc.stop()
means doing everything manually, parsing the lines and then inserting into Cassandra.
This roughly takes about 714020 ms in total for 10000000 records (incl. creating SparkContext and so on ...).
Then i read about the spark-cassandra-connector and did the following:
val conf = new SparkConf().setAppName("Application").setMaster("local[2]")
val sc = new SparkContext(conf)
var sql = new SQLContext(sc)
val customerScores = sql.read.json(file)
val customerScoresCorrected = customerScores
//Filter corrupt ones: empty values
.filter("customer is not null and score is not null")
//Filter corrupt ones: invalid properties
.select("customer", "score")
customerScoresCorrected.write
.format("org.apache.spark.sql.cassandra")
.mode(SaveMode.Append)
.options(Map("keyspace" -> "playground", "table" -> "customer_score"))
.save()
sc.stop()
So much simpler in sense of needed code and using given API.
This solution roughly takes 1232871 ms for 10000000 records (again all in all, so same measure points).
(Had a third solution as well, parsing manually plus using saveToCassandra which takes 1530877 ms)
Now my question:
Which way is the "correct" way to fulfil this usecase, so which one is the "best practice" (and in a real scenario, clustered cassandra and spark, the most performing one) nowadays?
Cause from my results i would use the "manual" stuff instead of SQLContext.read + SQLContext.write.
Thanks for your comments and hints in advance.
Actually after playing around now a long time, following has to be considered.
Of course amount of data
Type of your data: especially variety of partition keys (each one different vs. lots of duplicates)
The environment: Spark Executors, Cassandra Nodes, Replication ...
For my UseCase playing around with
def initSparkContext: SparkContext = {
val conf = new SparkConf().setAppName("Application").setMaster("local[2]")
// since we have nearly totally different PartitionKeys, default: 1000
.set("spark.cassandra.output.batch.grouping.buffer.size", "1")
// write as much concurrently, default: 5
.set("spark.cassandra.output.concurrent.writes", "1024")
// batch same replica, default: partition
.set("spark.cassandra.output.batch.grouping.key", "replica_set")
val sc = new SparkContext(conf)
sc
}
did boost speed dramatically in my local run.
So there is very much need to try out the various parameters to get YOUR best way. At least that is the conclusion i got.

Resources