Handle database connection inside spark streaming - apache-spark

I am not sure if I understand correctly how spark handle database connection and how to reliable using large number of database update operation insides spark without potential screw up the spark job. This is a code snippet I have been using (for easy illustration):
val driver = new MongoDriver
val hostList: List[String] = conf.getString("mongo.hosts").split(",").toList
val connection = driver.connection(hostList)
val mongodb = connection(conf.getString("mongo.db"))
val dailyInventoryCol = mongodb[BSONCollection](conf.getString("mongo.collections.dailyInventory"))
val stream: InputDStream[(String,String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](
ssc, kafkaParams, fromOffsets,
(mmd: MessageAndMetadata[String, String]) => (mmd.topic, mmd.message()));
def processRDD(rddElem: RDD[(String, String)]): Unit = {
val df = rdd.map(line => {
...
}).flatMap(x => x).toDF()
if (!isEmptyDF(df)) {
var mongoF: Seq[Future[dailyInventoryCol.BatchCommands.FindAndModifyCommand.FindAndModifyResult]] = Seq();
val dfF2 = df.groupBy($"CountryCode", $"Width", $"Height", $"RequestType", $"Timestamp").agg(sum($"Frequency")).collect().map(row => {
val countryCode = row.getString(0); val width = row.getInt(1); val height = row.getInt(2);
val requestType = row.getInt(3); val timestamp = row.getLong(4); val frequency = row.getLong(5);
val endTimestamp = timestamp + 24*60*60; //next day
val updateOp = dailyInventoryCol.updateModifier(BSONDocument("$inc" -> BSONDocument("totalFrequency" -> frequency)), false, true)
val f: Future[dailyInventoryCol.BatchCommands.FindAndModifyCommand.FindAndModifyResult] =
dailyInventoryCol.findAndModify(BSONDocument("width" -> width, "height" -> height, "country_code" -> countryCode, "request_type" -> requestType,
"startTs" -> timestamp, "endTs" -> endTimestamp), updateOp)
f
})
mongoF = mongoF ++ dfF2
//split into small chunk to avoid drying out the mongodb connection
val futureList: List[Seq[Future[dailyInventoryCol.BatchCommands.FindAndModifyCommand.FindAndModifyResult]]] = mongoF.grouped(200).toList
//future list
futureList.foreach(seqF => {
Await.result(Future.sequence(seqF), 40.seconds)
});
}
stream.foreachRDD(processRDD(_))
Basically, I am using Reactive Mongo (Scala) and for each RDD, I convert it into dataframe, group/extract the necessary data and then fire a large number of database update query against mongo. I want to ask:
I am using mesos to deploy spark on 3 servers and have one more server for mongo database. Is this the correct way to handle database connection. My concern is if database connection / polling is opened at the beginning of spark job and maintained properly (despite timeout/network error failover) during the whole duration of spark(weeks, months....) and if it will be closed when each batch finished? Given the fact that job might be scheduled on different servers? Does it means that each batch, it will open different set of DB connections?
What happen if exception occurs when executing queries. The spark job for that batch will failed? But the next batch will keep continue?
If there is too many queries (2000->+) to run update on mongo-database, and the executing time is exceeding configured spark batch duration (2 minutes), will it cause the problem? I was noticed that with my current setup, after abt 2-3 days, all of the batch is queued up as "Process" on Spark WebUI (if i disable the mongo update part, then i can run one week without prob), none is able to exit properly. Which basically hang up all batch job until i restart/resubmit the job.
Thanks a lot. I appreciate if you can help me address the issue.

Please read "Design Patterns for using foreachRDD" section in http://spark.apache.org/docs/latest/streaming-programming-guide.html. This will clear your doubts about how connections should be used/ created.
Secondly i would suggest to keep the direct update operations separate from your Spark Job. Better way would be that your spark job, process the data and then post it into a Kafka Queue and then have another dedicated process/ job/ code which reads the data from Kafka Queue and perform insert/ update operation on Mongo DB.

Related

Spark mapPartitions Issue

I am using spark mapPartition on my DF and the use case i should submit one Job (either calling lambda or sending a SQS Message) for each Partition.
I am partitioning on a custom formatted date column and logging the no.of partitions before and after and it is working as expected.
How ever when i see the total no.of jobs it is more than the no.of partitions. For Some of the partitions there are two or three jobs !!
Here is the Code i am using
val yearMonthQueryRDD = yearMonthQueryDF.rdd.mapPartitions(
partition => {
val partitionObjectList = new java.util.ArrayList[String]()
logger.info("partitionIndex = {}",TaskContext.getPartitionId());
val partitionCounter:AtomicLong = new AtomicLong(0)
val partitionSize:AtomicLong = new AtomicLong(0)
val paritionColumnName:AtomicReference[String] = new AtomicReference[String]();
// Iterate the Objects in a given parittion
val updatedPartition = partition.map( record => {
import yearMonthQueryDF.sparkSession.implicits._
partitionCounter.set(partitionCounter.get()+1)
val recordSizeInt = Integer.parseInt(record.getAs("object_size"))
val recordSize:Long = recordSizeInt.toLong
partitionObjectList.add(record.getAs("object_key"))
paritionColumnName.set(record.getAs("partition_column_name"))
record
}
).toList
logger_ref.info("No.of Elements in Partition ["+paritionColumnName.get()+"] are =["+partitionCounter.get()+"] Total Size=["+partitionSize.get()+"]")
// Submit a Job for the parition
// jobUtil.submitJob(paritionColumnName.get(),partitionObjectList,partitionSize.get())
updatedPartition.toIterator
}
)
Another thing that is making the debugging harder is the logging statements inside the mapPartitions() method are not found in the container error logs (since they are executed on each worker node not on master node i expected them to find them in container logs rather than in master node logs. Need to figure why i am only seeing stderr logs but not stdout logs on the containers though).
Thanks
Sateesh

How to use spark to write to HBase using multi-thread

I'm using spark to write data to HBase, but at the writing stage, only one executor and one core are executing.
I wonder why my code is not writing properly or what should I do to make it write faster?
Here is my code:
val df = ss.sql("SQL")
HBaseTableWriterUtil.hbaseWrite(ss, tableList, df)
def hbaseWrite(ss:SparkSession,tableList: List[String], df:DataFrame): Unit ={
val tableName = tableList(0)
val rowKeyName = tableList(4)
val rowKeyType = tableList(5)
hbaseConf.set(TableOutputFormat.OUTPUT_TABLE, s"${tableName}")
//写入到HBase
val sc = ss.sparkContext
sc.hadoopConfiguration.addResource(hbaseConf)
val columns = df.columns
val result = df.rdd.mapPartitions(par=>{
par.map(row=>{
var rowkey:String =""
if("String".equals(rowKeyType)){
rowkey = row.getAs[String](rowKeyName)
}else if("Long".equals(rowKeyType)){
rowkey = row.getAs[Long](rowKeyName).toString
}
val put = new Put(Bytes.toBytes(rowkey))
for(name<-columns){
var value = row.get(row.fieldIndex(name))
if(value!=null){
put.addColumn(Bytes.toBytes("cf"),Bytes.toBytes(name),Bytes.toBytes(value.toString))
}
}
(new ImmutableBytesWritable,put)
})
})
val job = Job.getInstance(sc.hadoopConfiguration)
job.setOutputKeyClass(classOf[ImmutableBytesWritable])
job.setOutputValueClass(classOf[Result])
job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
result.saveAsNewAPIHadoopDataset(job.getConfiguration)
}
You may not control how many parallel execute may write to HBase.
Though you can start multiple Spark jobs in multiThreaded client program.
e.g. You can have a shell script which triggers multiple spark-submit command to induce parallelism. Each spark job can work on one set of data independent to each other and push into HBase.
This can also be done using Spark Java/Scala SparkLauncher API using it with Java concurrent API (e.g. Executor framework).
val sparkLauncher = new SparkLauncher
//Set Spark properties.only Basic ones are shown here.It will be overridden if properties are set in Main class.
sparkLauncher.setSparkHome("/path/to/SPARK_HOME")
.setAppResource("/path/to/jar/to/be/executed")
.setMainClass("MainClassName")
.setMaster("MasterType like yarn or local[*]")
.setDeployMode("set deploy mode like cluster")
.setConf("spark.executor.cores","2")
// Lauch spark application
val sparkLauncher1 = sparkLauncher.startApplication()
//get jobId
val jobAppId = sparkLauncher1.getAppId
//Get status of job launched.THis loop will continuely show statuses like RUNNING,SUBMITED etc.
while (true) {
println(sparkLauncher1.getState().toString)
}
However, the challenge is to track each of them for failure and automatic recovery. It may be tricky specially when partial data is already written into HBase. i.e. A job fails to process the complete set of data assigned to it. You may have to automatically clean the data from HBase before automatically retrigger.

SPARK JDBC Connection Reuse for many Queries Executed

Borrowing from SO 26634853, then the following questions :
Using an IMPALA connection like this is a one-shot set up :
val JDBCDriver = "com.cloudera.impala.jdbc41.Driver"
val ConnectionURL = "jdbc:impala://url.server.net:21050/default;auth=noSasl"
Class.forName(JDBCDriver).newInstance
val con = DriverManager.getConnection(ConnectionURL)
val stmt = con.createStatement()
val rs = stmt.executeQuery(query)
val resultSetList = Iterator.continually((rs.next(), rs)).takeWhile(_._1).map(r => {
getRowFromResultSet(r._2) // (ResultSet) => (spark.sql.Row)
}).toList
sc.parallelize(resultSetList)
What if I need to put a loop around the con.createStatement() and associated code thereunder with some logic and execute it say, some, 5000 times?
Referring to the db connection overhead discussions with map vs. mapPartitions, would I, in this case, incur 5000 x the cost of the connection, or is it re-usable the way it is done here? From documentation on SCALA JDBC it looks like it can be reused.
My thinking is that as it is not a high-level SPARK API like df_mysql = sqlContext.read.format("jdbc").options ..., then I think it should remain open, but would like to check. May be the SPARK env closes it automatically, but I think not. At the end of processing a close could be issued?
Using the HIVE Context means we do not need to open the connection every time - or is that not so? Then using a parquet or ORC table, then I presume would allow such an approach as performance is quite fast.
I tried this simulation and connection remains open, so provided not in foreach, it is not an issue performance-wise.
var counter = 0
do
{
counter = counter + 1
val dataframe_mysql = spark.read.jdbc(jdbcUrl, "(select author from family) f ", connectionProperties)
dataframe_mysql.show
} while (counter < 3)

Spark Streaming and Azure Event Hubs mapWithState

I've successfully integrated the code to pull messages off the event hub and process them through spark/spark-streaming. I'm now moving onto managing state as the messages pass through. This is the code I'm using, which for the most part is an adaptation of https://docs.cloud.databricks.com/docs/spark/1.6/examples/Streaming%20mapWithState.html
Essentially this works with a dummy source, it works with a single stream on a single partition but it doesn't work for the unionised window stream.. While I could create multiple instances of the stream one for each partition it kinda defeats the point of the union and window.. + my attempts to get it working that way failed. I'm kinda stuck for inspiration on where to go now.. if anyone has any ideas that would be grand..
val sparkSession = SparkSession.builder().master("local[2]").config(sparkConfiguration).getOrCreate()
val streamingContext = new StreamingContext(sparkSession.sparkContext, Seconds(10))
streamingContext.checkpoint(inputOptions.checkpointDir)
//derive the stream and window
val eventHubsStream = EventHubsUtils.createUnionStream(streamingContext, eventHubsParameters)
val eventHubsWindowedStream = eventHubsStream.window(Seconds(10))
val initialRDD = sparkSession.sparkContext.parallelize(List(("dummy", 100L), ("source", 32L)))
val stateSpec = StateSpec.function(trackStateFunc _)
.initialState(initialRDD)
.numPartitions(2)
.timeout(Seconds(60))
val eventStream = eventHubsWindowedStream
.map(messageStr => {
//parse teh event
var event = gson.fromJson(new String(messageStr), classOf[Event])
//return a tuble of key/value pair
(event.product_id.toString, 1)
})
val eventStateStream = eventStream.mapWithState(stateSpec)
val stateSnapshotStream = eventStateStream.stateSnapshots()
stateSnapshotStream.print()
stateSnapshotStream.foreachRDD { rdd =>
import sparkSession.implicits._
rdd.toDF("word", "count").registerTempTable("batch_word_count")
}
streamingContext.remember(Minutes(1))
streamingContext
I resolved my issue, in that i ended up using the direct stream and all my problems have gone away. I had avoided this as the progress directory only supports HDFS or ADL and now i can no longer test locally.
EventHubsUtils.createDirectStreams(streamingContext, inputOptions.namespace,
inputOptions.hdfs, Map(inputOptions.eventhub -> GetEventHubParams(inputOptions)))
Still, the union stream doesn't work.. Now i just need to figure out how to delete the progress directory in HDFS!!

Spark Cassandra Connector: SQLContext.read + SQLContext.write vs. manual parsing and inserting (JSON -> Cassandra)

Good morning,
i just started investigating Apache Spark and Apache Cassandra. First step is a real simple use-case: taking a file containing e.g. customer + score.
Cassandra table has customer as PrimaryKey. Cassandra is just running locally (so no cluster at all!).
So the SparkJob (Standalone local[2]) is parsing the JSON file and then writing the whole stuff into Cassandra.
First solution was
val conf = new SparkConf().setAppName("Application").setMaster("local[2]")
val sc = new SparkContext(conf)
val cass = CassandraConnector(conf)
val customerScores = sc.textFile(file).cache()
val customerScoreRDD = customerScores.mapPartitions(lines => {
val mapper = new ObjectMapper with ScalaObjectMapper
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
mapper.registerModule(DefaultScalaModule)
lines
.map(line => {
mapper.readValue(line, classOf[CustomerScore])
})
//Filter corrupt ones: empty values
.filter(customerScore => customerScore.customer != null && customerScore.score != null)
})
customerScoreRDD.foreachPartition(rows => cass.withSessionDo(session => {
val statement: PreparedStatement = session.prepare("INSERT INTO playground.customer_score (customer,score) VALUES (:customer,:score)")
rows.foreach(row => {
session.executeAsync(statement.bind(row.customer.asInstanceOf[Object], row.score))
})
}))
sc.stop()
means doing everything manually, parsing the lines and then inserting into Cassandra.
This roughly takes about 714020 ms in total for 10000000 records (incl. creating SparkContext and so on ...).
Then i read about the spark-cassandra-connector and did the following:
val conf = new SparkConf().setAppName("Application").setMaster("local[2]")
val sc = new SparkContext(conf)
var sql = new SQLContext(sc)
val customerScores = sql.read.json(file)
val customerScoresCorrected = customerScores
//Filter corrupt ones: empty values
.filter("customer is not null and score is not null")
//Filter corrupt ones: invalid properties
.select("customer", "score")
customerScoresCorrected.write
.format("org.apache.spark.sql.cassandra")
.mode(SaveMode.Append)
.options(Map("keyspace" -> "playground", "table" -> "customer_score"))
.save()
sc.stop()
So much simpler in sense of needed code and using given API.
This solution roughly takes 1232871 ms for 10000000 records (again all in all, so same measure points).
(Had a third solution as well, parsing manually plus using saveToCassandra which takes 1530877 ms)
Now my question:
Which way is the "correct" way to fulfil this usecase, so which one is the "best practice" (and in a real scenario, clustered cassandra and spark, the most performing one) nowadays?
Cause from my results i would use the "manual" stuff instead of SQLContext.read + SQLContext.write.
Thanks for your comments and hints in advance.
Actually after playing around now a long time, following has to be considered.
Of course amount of data
Type of your data: especially variety of partition keys (each one different vs. lots of duplicates)
The environment: Spark Executors, Cassandra Nodes, Replication ...
For my UseCase playing around with
def initSparkContext: SparkContext = {
val conf = new SparkConf().setAppName("Application").setMaster("local[2]")
// since we have nearly totally different PartitionKeys, default: 1000
.set("spark.cassandra.output.batch.grouping.buffer.size", "1")
// write as much concurrently, default: 5
.set("spark.cassandra.output.concurrent.writes", "1024")
// batch same replica, default: partition
.set("spark.cassandra.output.batch.grouping.key", "replica_set")
val sc = new SparkContext(conf)
sc
}
did boost speed dramatically in my local run.
So there is very much need to try out the various parameters to get YOUR best way. At least that is the conclusion i got.

Resources