How to Use spark cassandra connector API in scala - apache-spark

My previous post: Reparing Prepared stmt warning.
i was not able to solve it, with few suggestions, i tried using spark cassandra connector to solve my problem.
But i am completely confused about its usage in my application.
i tried to write code as below,but not sure how exactly to use the API's.
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "1.1.1.1")
.set("spark.cassandra.auth.username", "auser")
.set("spark.cassandra.auth.password", "apass")
.set("spark.cassandra.connection.port","9042")
val sc=new SparkContext(conf)
val c = CassandraConnector(sc.getConf)
c.withSessionDo ( session => session.prepareStatement(session,insertQuery)
val boundStatement = new BoundStatement(insertStatement)
batch.add(boundStatement.bind(data.service_id, data.asset_id, data.summ_typ, data.summ_dt, data.trp_summ_id, data.asset_serial_no, data.avg_sp, data.c_dist, data.c_epa, data.c_gal, data.c_mil, data.device_id, data.device_serial_no, data.dist, data.en_dt, data.en_lat, data.en_long, data.epa, data.gal, data.h_dist, data.h_epa, data.h_gal, data.h_mil, data.id_tm, data.max_sp, data.mil, data.rec_crt_dt, data.st_lat, data.st_long, data.tr_dis, data.tr_dt, data.tr_dur, data.st_addr, data.en_addr))
)
def prepareStatement(session: Session, query: String): PreparedStatement = {
val cluster = session.clustername
get(cluster, query.toString) match {
case Some(stmt) => stmt
case None =>
synchronized {
get(cluster, query.toString) match {
case Some(stmt) => stmt
case None =>
val stmt = session.prepare(query)
put(cluster, query.toString, stmt)
}
}
}
}
-----------------------------------------------------------------------------------------OR
val table1 = spark.read
.format("org.apache.spark.sql.cassandra")
.option( "spark.cassandra.auth.username","apoch_user")
.option("spark.cassandra.auth.password","Apoch#123")
.options(Map(
"table" -> "trip_summary_data",
"keyspace" -> "aphoc" ,
"cluster" -> "Cluster1"
) ).load()
def insert( data: TripHistoryData) {
table1.createOrReplaceTempView("inputTable1");
val df1= spark.sql("select * from inputTable1 where service_id = ? and asset_id = ? and summ_typ = ? and summ_dt >= ? and summ_dt <= ?");
val df2=spark.sql("insert into inputTable1 values (data.service_id, data.asset_id, data.summ_typ, data.summ_dt, data.trp_summ_id, data.asset_serial_no, data.avg_sp, data.c_dist, data.c_epa, data.c_gal, data.c_mil, data.device_id, data.device_serial_no, data.dist, data.en_dt, data.en_lat, data.en_long, data.epa, data.gal, data.h_dist, data.h_epa, data.h_gal, data.h_mil, data.id_tm, data.max_sp, data.mil, data.rec_crt_dt, data.st_lat, data.st_long, data.tr_dis, data.tr_dt, data.tr_dur, data.st_addr, data.en_addr))
}

You need to concentrate on how you process your data in Spark application, not how the data are read or written (it matters, of course, but only when you hit performance problems).
If you're using Spark, then you need to think in the Spark terms as you're processing data in RDDs or DataFrames. In this case you need to use constructs like these (with DataFrames):
val df = spark
.read
.cassandraFormat("words", "test")
.load()
val newDf = df.sql(...) // some operation on source data
newDF.write
.cassandraFormat("words_copy", "test")
.save()
And avoid the use of direct session.prepare/session.execute, cluster.connect, etc. - Spark connector will do prepare, and other optimizations under the hood.

Related

How to write two streaming df's into two different tables in MySQL in Spark sturctured streaming?

I am using spark 2.3.2 Version.
I have written code in spark structured streaming to insert streaming dataframes data into two different MySQL tables.
Let say there are two streaming df's: DF1, DF2.
I have written two queries(query1,query2) using foreachWriter API to write into MySQL tables from different streamings respectively. I.E. DF1 into MYSQLtable A and DF2 into MYSQL table B.
When I run the spark job, first it runs query1 and then query2, so it's writing to table A but not into table B.
If I change my code to run query2 first and then query1, its writing into table B but not into table A.
So I understand that it's executing the first coming query only to write into the table.
Note: I have tried giving different MySQL user/database to two tables respectively. But no luck.
Can anyone please advise? How to make it work.
My code is below:
import java.sql._
class JDBCSink1(url:String, user:String, pwd:String) extends ForeachWriter[org.apache.spark.sql.Row] {
val driver = "com.mysql.jdbc.Driver"
var connection:Connection = _
var statement:Statement = _
def open(partitionId: Long,version: Long): Boolean = {
Class.forName(driver)
connection = DriverManager.getConnection(url, user, pwd)
statement = connection.createStatement
true
}
def process(value: (org.apache.spark.sql.Row)): Unit = {
val insertSql = """ INSERT INTO tableA(col1,col2,col3) VALUES(?,?,?); """
val preparedStmt: PreparedStatement = connection.prepareStatement(insertSql)
preparedStmt.setString (1, value(0).toString)
preparedStmt.setString (2, value(1).toString)
preparedStmt.setString (3, value(2).toString)
preparedStmt.execute
}
def close(errorOrNull: Throwable): Unit = {
connection.close
}
}
class JDBCSink2(url:String, user:String, pwd:String) extends ForeachWriter[org.apache.spark.sql.Row] {
val driver = "com.mysql.jdbc.Driver"
var connection:Connection = _
var statement:Statement = _
def open(partitionId: Long,version: Long): Boolean = {
Class.forName(driver)
connection = DriverManager.getConnection(url, user, pwd)
statement = connection.createStatement
true
}
def process(value: (org.apache.spark.sql.Row)): Unit = {
val insertSql = """ INSERT INTO tableB(col1,col2) VALUES(?,?); """
val preparedStmt: PreparedStatement = connection.prepareStatement(insertSql)
preparedStmt.setString (1, value(0).toString)
preparedStmt.setString (2, value(1).toString)
preparedStmt.execute
}
def close(errorOrNull: Throwable): Unit = {
connection.close
}
}
val url1="jdbc:mysql://hostname:3306/db1"
val url2="jdbc:mysql://hostname:3306/db2"
val user1 ="usr1"
val user2="usr2"
val pwd = "password"
val Writer1 = new JDBCSink1(url1,user1, pwd)
val Writer2 = new JDBCSink2(url2,user2, pwd)
val query2 =
streamDF2
.writeStream
.foreach(Writer2)
.outputMode("append")
.trigger(ProcessingTime("35 seconds"))
.start().awaitTermination()
val query1 =
streamDF1
.writeStream
.foreach(Writer1)
.outputMode("append")
.trigger(ProcessingTime("30 seconds"))
.start().awaitTermination()
You are blocking the second query because of the awaitTermination. If you want to have two output streams you need to start both before waiting for their termination:
val query2 =
streamDF2
.writeStream
.foreach(Writer2)
.outputMode("append")
.trigger(ProcessingTime("35 seconds"))
.start()
val query1 =
streamDF1
.writeStream
.foreach(Writer1)
.outputMode("append")
.trigger(ProcessingTime("30 seconds"))
.start()
query1.awaitTermination()
query2.awaitTermination()
Edit:
Spark also allows you to schedule and allocate resources to the different streaming queries as described in Scheduling within an application. You can configure your pools based on
schedulingMode: can be FIFO or FAIR
weight: "This controls the pool’s share of the cluster relative to other pools. By default, all pools have a weight of 1. If you give a specific pool a weight of 2, for example, it will get 2x more resources as other active pools."
minShare: "Apart from an overall weight, each pool can be given a minimum shares (as a number of CPU cores) that the administrator would like it to have."
The pool configurations can be set by creating an XML file, similar to conf/fairscheduler.xml.template, and either putting a file named fairscheduler.xml on the classpath, or setting spark.scheduler.allocation.file property in your SparkConf.
conf.set("spark.scheduler.allocation.file", "/path/to/file")
Applying different pool can be done like below:
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool2")
// In the above example you could then tell Spark to make use of the pools
val query1 = streamDF1.writeStream.[...].start(pool1)
val query2 = streamDF2.writeStream.[...].start(pool2)

Spark write to postgres slow

I'm writing data (approx. 83M records) from a dataframe into postgresql and it's kind of slow. Takes 2.7hrs to complete writing to db.
Looking at the executors, there is only one active task running on just one executor. Is there any way I could parallelize the writes into db using all executors in Spark?
...
val prop = new Properties()
prop.setProperty("user", DB_USER)
prop.setProperty("password", DB_PASSWORD)
prop.setProperty("driver", "org.postgresql.Driver")
salesReportsDf.write
.mode(SaveMode.Append)
.jdbc(s"jdbc:postgresql://$DB_HOST:$DB_PORT/$DATABASE", REPORTS_TABLE, prop)
Thanks
So I figured out the problem. Basically, repartitioning my dataframe increase the database write throughput by 100%
def srcTable(config: Config): Map[String, String] = {
val SERVER = config.getString("db_host")
val PORT = config.getInt("db_port")
val DATABASE = config.getString("database")
val USER = config.getString("db_user")
val PASSWORD = config.getString("db_password")
val TABLE = config.getString("table")
val PARTITION_COL = config.getString("partition_column")
val LOWER_BOUND = config.getString("lowerBound")
val UPPER_BOUND = config.getString("upperBound")
val NUM_PARTITION = config.getString("numPartitions")
Map(
"url" -> s"jdbc:postgresql://$SERVER:$PORT/$DATABASE",
"driver" -> "org.postgresql.Driver",
"dbtable" -> TABLE,
"user" -> USER,
"password"-> PASSWORD,
"partitionColumn" -> PARTITION_COL,
"lowerBound" -> LOWER_BOUND,
"upperBound" -> UPPER_BOUND,
"numPartitions" -> NUM_PARTITION
)
}
Spark also has a option called "batchsize" while writing using jdbc. The default value is pretty low.(1000)
connectionProperties.put("batchsize", "100000")
Setting it to much higher values should speed up writing to external DataBases.

Regarding Spark Dataframereader jdbc

I have a question regarding Mechanics of Spark Dataframereader. I will appreciate if anybody can help me. Let me explain the Scenario here
I am creating a DataFrame from Dstream like this. This in Input Data
var config = new HashMap[String,String]();
config += ("zookeeper.connect" ->zookeeper);
config += ("partition.assignment.strategy" ->"roundrobin");
config += ("bootstrap.servers" ->broker);
config += ("serializer.class" -> "kafka.serializer.DefaultEncoder");
config += ("group.id" -> "default");
val lines = KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc,config.toMap,Set(topic)).map(_._2)
lines.foreachRDD { rdd =>
if(!rdd.isEmpty()){
val rddJson = rdd.map { x => MyFunctions.mapToJson(x) }
val sqlContext = SQLContextSingleton.getInstance(ssc.sparkContext)
val rddDF = sqlContext.read.json(rddJson)
rddDF.registerTempTable("inputData")
val dbDF = ReadDataFrameHelper.readDataFrameHelperFromDB(sqlContext, jdbcUrl, "ABCD","A",numOfPartiton,lowerBound,upperBound)
Here is the code of ReadDataFrameHelper
def readDataFrameHelperFromDB(sqlContext:HiveContext,jdbcUrl:String,dbTableOrQuery:String,
columnToPartition:String,numOfPartiton:Int,lowerBound:Int,highBound:Int):DataFrame={
val jdbcDF = sqlContext.read.jdbc(url = jdbcUrl, table = dbTableOrQuery,
columnName = columnToPartition,
lowerBound = lowerBound,
upperBound = highBound,
numPartitions = numOfPartiton,
connectionProperties = new java.util.Properties()
)
jdbcDF
}
Lastly I am doing a Join like this
val joinedData = rddDF.join(dbDF,rddDF("ID") === dbDF("ID")
&& rddDF("CODE") === dbDF("CODE"),"left_outer")
.drop(dbDF("code"))
.drop(dbDF("id"))
.drop(dbDF("number"))
.drop(dbDF("key"))
.drop(dbDF("loaddate"))
.drop(dbDF("fid"))
joinedData.show()
My input DStream will have 1000 rows and data will contains million of rows. So when I do this join, will spark load all the rows from database and read those rows or will this just read those specific rows from DB which have the code,id from the input DStream
As specified by zero323, i have also confirmed that data will be read full from the table. I checked the DB session logs and saw that whole dataset is getting loaded.
Thanks zero323

How two RDD according to funcation get Result RDD

I am a beginner of Apache Spark. I want to filter two RDD into result RDD with the below code
def runSpark(stList:List[SubStTime],icList:List[IcTemp]): Unit ={
val conf = new SparkConf().setAppName("OD").setMaster("local[*]")
val sc = new SparkContext(conf)
val st = sc.parallelize(stList).map(st => ((st.productId,st.routeNo),st)).groupByKey()
val ic = sc.parallelize(icList).map(ic => ((ic.productId,ic.routeNo),ic)).groupByKey()
//TODO
//val result = st.join(ic).mapValues( )
sc.stop()
}
here is what i want to do
List[ST] ->map ->Map(Key,st) ->groupByKey ->Map(Key,List[st])
List[IC] ->map ->Map(Key,ic) ->groupByKey ->Map(Key,List[ic])
STRDD join ICRDD get Map(Key,(List[st],List[ic]))
I have a function compare listST and listIC get the List[result] result contains both SubStTime and IcTemp information
def calcIcSt(st:List[SubStTime],ic:List[IcTemp]): List[result]
I don't know how to use mapvalues or other some way to get my result
Thanks
val result = st.join(ic).mapValues( x => calcIcSt(x._1,x._2) )

How to build a lookup map in Spark Streaming?

What is the best way to maintain application state in a spark streaming application?
I know of two ways :
use "Union" operation to append to the lookup RDD and persist it after each union.
save the state in a file or database and load it in the start of each batch.
My question is from the performance perspective which one is better ? Also, is there a better way to do this?
You should really be using mapWithState(spec: StateSpec[K, V, StateType, MappedType]) as follows:
import org.apache.spark.streaming.{ StreamingContext, Seconds }
val ssc = new StreamingContext(sc, batchDuration = Seconds(5))
// checkpointing is mandatory
ssc.checkpoint("_checkpoints")
val rdd = sc.parallelize(0 to 9).map(n => (n, n % 2 toString))
import org.apache.spark.streaming.dstream.ConstantInputDStream
val sessions = new ConstantInputDStream(ssc, rdd)
import org.apache.spark.streaming.{State, StateSpec, Time}
val updateState = (batchTime: Time, key: Int, value: Option[String], state: State[Int]) => {
println(s">>> batchTime = $batchTime")
println(s">>> key = $key")
println(s">>> value = $value")
println(s">>> state = $state")
val sum = value.getOrElse("").size + state.getOption.getOrElse(0)
state.update(sum)
Some((key, value, sum)) // mapped value
}
val spec = StateSpec.function(updateState)
val mappedStatefulStream = sessions.mapWithState(spec)
mappedStatefulStream.print()

Resources