Spark write to postgres slow - apache-spark

I'm writing data (approx. 83M records) from a dataframe into postgresql and it's kind of slow. Takes 2.7hrs to complete writing to db.
Looking at the executors, there is only one active task running on just one executor. Is there any way I could parallelize the writes into db using all executors in Spark?
...
val prop = new Properties()
prop.setProperty("user", DB_USER)
prop.setProperty("password", DB_PASSWORD)
prop.setProperty("driver", "org.postgresql.Driver")
salesReportsDf.write
.mode(SaveMode.Append)
.jdbc(s"jdbc:postgresql://$DB_HOST:$DB_PORT/$DATABASE", REPORTS_TABLE, prop)
Thanks

So I figured out the problem. Basically, repartitioning my dataframe increase the database write throughput by 100%
def srcTable(config: Config): Map[String, String] = {
val SERVER = config.getString("db_host")
val PORT = config.getInt("db_port")
val DATABASE = config.getString("database")
val USER = config.getString("db_user")
val PASSWORD = config.getString("db_password")
val TABLE = config.getString("table")
val PARTITION_COL = config.getString("partition_column")
val LOWER_BOUND = config.getString("lowerBound")
val UPPER_BOUND = config.getString("upperBound")
val NUM_PARTITION = config.getString("numPartitions")
Map(
"url" -> s"jdbc:postgresql://$SERVER:$PORT/$DATABASE",
"driver" -> "org.postgresql.Driver",
"dbtable" -> TABLE,
"user" -> USER,
"password"-> PASSWORD,
"partitionColumn" -> PARTITION_COL,
"lowerBound" -> LOWER_BOUND,
"upperBound" -> UPPER_BOUND,
"numPartitions" -> NUM_PARTITION
)
}

Spark also has a option called "batchsize" while writing using jdbc. The default value is pretty low.(1000)
connectionProperties.put("batchsize", "100000")
Setting it to much higher values should speed up writing to external DataBases.

Related

How to Use spark cassandra connector API in scala

My previous post: Reparing Prepared stmt warning.
i was not able to solve it, with few suggestions, i tried using spark cassandra connector to solve my problem.
But i am completely confused about its usage in my application.
i tried to write code as below,but not sure how exactly to use the API's.
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "1.1.1.1")
.set("spark.cassandra.auth.username", "auser")
.set("spark.cassandra.auth.password", "apass")
.set("spark.cassandra.connection.port","9042")
val sc=new SparkContext(conf)
val c = CassandraConnector(sc.getConf)
c.withSessionDo ( session => session.prepareStatement(session,insertQuery)
val boundStatement = new BoundStatement(insertStatement)
batch.add(boundStatement.bind(data.service_id, data.asset_id, data.summ_typ, data.summ_dt, data.trp_summ_id, data.asset_serial_no, data.avg_sp, data.c_dist, data.c_epa, data.c_gal, data.c_mil, data.device_id, data.device_serial_no, data.dist, data.en_dt, data.en_lat, data.en_long, data.epa, data.gal, data.h_dist, data.h_epa, data.h_gal, data.h_mil, data.id_tm, data.max_sp, data.mil, data.rec_crt_dt, data.st_lat, data.st_long, data.tr_dis, data.tr_dt, data.tr_dur, data.st_addr, data.en_addr))
)
def prepareStatement(session: Session, query: String): PreparedStatement = {
val cluster = session.clustername
get(cluster, query.toString) match {
case Some(stmt) => stmt
case None =>
synchronized {
get(cluster, query.toString) match {
case Some(stmt) => stmt
case None =>
val stmt = session.prepare(query)
put(cluster, query.toString, stmt)
}
}
}
}
-----------------------------------------------------------------------------------------OR
val table1 = spark.read
.format("org.apache.spark.sql.cassandra")
.option( "spark.cassandra.auth.username","apoch_user")
.option("spark.cassandra.auth.password","Apoch#123")
.options(Map(
"table" -> "trip_summary_data",
"keyspace" -> "aphoc" ,
"cluster" -> "Cluster1"
) ).load()
def insert( data: TripHistoryData) {
table1.createOrReplaceTempView("inputTable1");
val df1= spark.sql("select * from inputTable1 where service_id = ? and asset_id = ? and summ_typ = ? and summ_dt >= ? and summ_dt <= ?");
val df2=spark.sql("insert into inputTable1 values (data.service_id, data.asset_id, data.summ_typ, data.summ_dt, data.trp_summ_id, data.asset_serial_no, data.avg_sp, data.c_dist, data.c_epa, data.c_gal, data.c_mil, data.device_id, data.device_serial_no, data.dist, data.en_dt, data.en_lat, data.en_long, data.epa, data.gal, data.h_dist, data.h_epa, data.h_gal, data.h_mil, data.id_tm, data.max_sp, data.mil, data.rec_crt_dt, data.st_lat, data.st_long, data.tr_dis, data.tr_dt, data.tr_dur, data.st_addr, data.en_addr))
}
You need to concentrate on how you process your data in Spark application, not how the data are read or written (it matters, of course, but only when you hit performance problems).
If you're using Spark, then you need to think in the Spark terms as you're processing data in RDDs or DataFrames. In this case you need to use constructs like these (with DataFrames):
val df = spark
.read
.cassandraFormat("words", "test")
.load()
val newDf = df.sql(...) // some operation on source data
newDF.write
.cassandraFormat("words_copy", "test")
.save()
And avoid the use of direct session.prepare/session.execute, cluster.connect, etc. - Spark connector will do prepare, and other optimizations under the hood.

Perisistent in-memory database in Apache Spark

I have a custom foreach writer for Spark streaming. For each row I write to JDBC source. I also want to do somekind of fast lookup before I perform a JDBC operation and update the value after I perform JDBC operations, like "Step-1" and "Step-3" in below sample code ...
I don't want to use external databases like REDIS, MongoDB. I want something with low foot print like RocksDB, Derby, etc ...
I'm okay with storing one-file per application, just like checkpointing , I'll create a internal-db folder ...
I could not see any in-memory DB for Spark ..
def main(args: Array[String]): Unit = {
val brokers = "quickstart:9092"
val topic = "safe_message_landing_app_4"
val sparkSession = SparkSession.builder().master("local[*]").appName("Ganesh-Kafka-JDBC-Streaming").getOrCreate();
val sparkContext = sparkSession.sparkContext;
sparkContext.setLogLevel("ERROR")
val sqlContext = sparkSession.sqlContext;
val kafkaDataframe = sparkSession.readStream.format("kafka")
.options(Map("kafka.bootstrap.servers" -> brokers, "subscribe" -> topic,
"startingOffsets" -> "latest", "group.id" -> " Jai Ganesh", "checkpoint" -> "cp/kafka_reader"))
.load()
kafkaDataframe.printSchema()
kafkaDataframe.createOrReplaceTempView("kafka_view")
val sqlDataframe = sqlContext.sql("select concat ( topic, '-' , partition, '-' , offset) as KEY, string(value) as VALUE from kafka_view")
val customForEachWriter = new ForeachWriter[Row] {
override def open(partitionId: Long, version: Long) = {
println("Open Started ==> partitionId ==> " + partitionId + " ==> version ==> " + version)
true
}
override def process(value: Row) = {
// Step 1 ==> Lookup a key in persistent KEY-VALUE store
// JDBC operations
// Step 3 ==> Update the value in persistent KEY-VALUE store
}
override def close(errorOrNull: Throwable) = {
println(" ************** Closed ****************** ")
}
}
val yy = sqlDataframe
.writeStream
.queryName("foreachquery")
.foreach(customForEachWriter)
.start()
yy.awaitTermination()
sparkSession.close();
}
Manjesh,
What you are looking for, "Spark and your in-memory DB as one seamless cluster, sharing a single process space", with support for MVCC is exactly what SnappyData provides. With SnappyData, the tables that you want to do a fast lookup on are in the same process that is running your Spark streaming job. Check it out here
SnappyData has a Apache V2 license for the core product and the specific use that you are referring to is available in the OSS download.
(Disclosure: I am a SnappyData employee and it makes sense to provide a product specific answer to this question because the product is the answer to the question)

Spark Cassandra Connector: SQLContext.read + SQLContext.write vs. manual parsing and inserting (JSON -> Cassandra)

Good morning,
i just started investigating Apache Spark and Apache Cassandra. First step is a real simple use-case: taking a file containing e.g. customer + score.
Cassandra table has customer as PrimaryKey. Cassandra is just running locally (so no cluster at all!).
So the SparkJob (Standalone local[2]) is parsing the JSON file and then writing the whole stuff into Cassandra.
First solution was
val conf = new SparkConf().setAppName("Application").setMaster("local[2]")
val sc = new SparkContext(conf)
val cass = CassandraConnector(conf)
val customerScores = sc.textFile(file).cache()
val customerScoreRDD = customerScores.mapPartitions(lines => {
val mapper = new ObjectMapper with ScalaObjectMapper
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
mapper.registerModule(DefaultScalaModule)
lines
.map(line => {
mapper.readValue(line, classOf[CustomerScore])
})
//Filter corrupt ones: empty values
.filter(customerScore => customerScore.customer != null && customerScore.score != null)
})
customerScoreRDD.foreachPartition(rows => cass.withSessionDo(session => {
val statement: PreparedStatement = session.prepare("INSERT INTO playground.customer_score (customer,score) VALUES (:customer,:score)")
rows.foreach(row => {
session.executeAsync(statement.bind(row.customer.asInstanceOf[Object], row.score))
})
}))
sc.stop()
means doing everything manually, parsing the lines and then inserting into Cassandra.
This roughly takes about 714020 ms in total for 10000000 records (incl. creating SparkContext and so on ...).
Then i read about the spark-cassandra-connector and did the following:
val conf = new SparkConf().setAppName("Application").setMaster("local[2]")
val sc = new SparkContext(conf)
var sql = new SQLContext(sc)
val customerScores = sql.read.json(file)
val customerScoresCorrected = customerScores
//Filter corrupt ones: empty values
.filter("customer is not null and score is not null")
//Filter corrupt ones: invalid properties
.select("customer", "score")
customerScoresCorrected.write
.format("org.apache.spark.sql.cassandra")
.mode(SaveMode.Append)
.options(Map("keyspace" -> "playground", "table" -> "customer_score"))
.save()
sc.stop()
So much simpler in sense of needed code and using given API.
This solution roughly takes 1232871 ms for 10000000 records (again all in all, so same measure points).
(Had a third solution as well, parsing manually plus using saveToCassandra which takes 1530877 ms)
Now my question:
Which way is the "correct" way to fulfil this usecase, so which one is the "best practice" (and in a real scenario, clustered cassandra and spark, the most performing one) nowadays?
Cause from my results i would use the "manual" stuff instead of SQLContext.read + SQLContext.write.
Thanks for your comments and hints in advance.
Actually after playing around now a long time, following has to be considered.
Of course amount of data
Type of your data: especially variety of partition keys (each one different vs. lots of duplicates)
The environment: Spark Executors, Cassandra Nodes, Replication ...
For my UseCase playing around with
def initSparkContext: SparkContext = {
val conf = new SparkConf().setAppName("Application").setMaster("local[2]")
// since we have nearly totally different PartitionKeys, default: 1000
.set("spark.cassandra.output.batch.grouping.buffer.size", "1")
// write as much concurrently, default: 5
.set("spark.cassandra.output.concurrent.writes", "1024")
// batch same replica, default: partition
.set("spark.cassandra.output.batch.grouping.key", "replica_set")
val sc = new SparkContext(conf)
sc
}
did boost speed dramatically in my local run.
So there is very much need to try out the various parameters to get YOUR best way. At least that is the conclusion i got.

Regarding Spark Dataframereader jdbc

I have a question regarding Mechanics of Spark Dataframereader. I will appreciate if anybody can help me. Let me explain the Scenario here
I am creating a DataFrame from Dstream like this. This in Input Data
var config = new HashMap[String,String]();
config += ("zookeeper.connect" ->zookeeper);
config += ("partition.assignment.strategy" ->"roundrobin");
config += ("bootstrap.servers" ->broker);
config += ("serializer.class" -> "kafka.serializer.DefaultEncoder");
config += ("group.id" -> "default");
val lines = KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc,config.toMap,Set(topic)).map(_._2)
lines.foreachRDD { rdd =>
if(!rdd.isEmpty()){
val rddJson = rdd.map { x => MyFunctions.mapToJson(x) }
val sqlContext = SQLContextSingleton.getInstance(ssc.sparkContext)
val rddDF = sqlContext.read.json(rddJson)
rddDF.registerTempTable("inputData")
val dbDF = ReadDataFrameHelper.readDataFrameHelperFromDB(sqlContext, jdbcUrl, "ABCD","A",numOfPartiton,lowerBound,upperBound)
Here is the code of ReadDataFrameHelper
def readDataFrameHelperFromDB(sqlContext:HiveContext,jdbcUrl:String,dbTableOrQuery:String,
columnToPartition:String,numOfPartiton:Int,lowerBound:Int,highBound:Int):DataFrame={
val jdbcDF = sqlContext.read.jdbc(url = jdbcUrl, table = dbTableOrQuery,
columnName = columnToPartition,
lowerBound = lowerBound,
upperBound = highBound,
numPartitions = numOfPartiton,
connectionProperties = new java.util.Properties()
)
jdbcDF
}
Lastly I am doing a Join like this
val joinedData = rddDF.join(dbDF,rddDF("ID") === dbDF("ID")
&& rddDF("CODE") === dbDF("CODE"),"left_outer")
.drop(dbDF("code"))
.drop(dbDF("id"))
.drop(dbDF("number"))
.drop(dbDF("key"))
.drop(dbDF("loaddate"))
.drop(dbDF("fid"))
joinedData.show()
My input DStream will have 1000 rows and data will contains million of rows. So when I do this join, will spark load all the rows from database and read those rows or will this just read those specific rows from DB which have the code,id from the input DStream
As specified by zero323, i have also confirmed that data will be read full from the table. I checked the DB session logs and saw that whole dataset is getting loaded.
Thanks zero323

How to build a lookup map in Spark Streaming?

What is the best way to maintain application state in a spark streaming application?
I know of two ways :
use "Union" operation to append to the lookup RDD and persist it after each union.
save the state in a file or database and load it in the start of each batch.
My question is from the performance perspective which one is better ? Also, is there a better way to do this?
You should really be using mapWithState(spec: StateSpec[K, V, StateType, MappedType]) as follows:
import org.apache.spark.streaming.{ StreamingContext, Seconds }
val ssc = new StreamingContext(sc, batchDuration = Seconds(5))
// checkpointing is mandatory
ssc.checkpoint("_checkpoints")
val rdd = sc.parallelize(0 to 9).map(n => (n, n % 2 toString))
import org.apache.spark.streaming.dstream.ConstantInputDStream
val sessions = new ConstantInputDStream(ssc, rdd)
import org.apache.spark.streaming.{State, StateSpec, Time}
val updateState = (batchTime: Time, key: Int, value: Option[String], state: State[Int]) => {
println(s">>> batchTime = $batchTime")
println(s">>> key = $key")
println(s">>> value = $value")
println(s">>> state = $state")
val sum = value.getOrElse("").size + state.getOption.getOrElse(0)
state.update(sum)
Some((key, value, sum)) // mapped value
}
val spec = StateSpec.function(updateState)
val mappedStatefulStream = sessions.mapWithState(spec)
mappedStatefulStream.print()

Resources