Spark : How to make calls to database using foreachPartition - apache-spark

We have spark streaming job ..writing data to AmazonDynamoDB using foreachRDD but it is very slow with our consumption rate at 10,000/sec and writing 10,000 takes 35min ...this is the code piece ..
tempRequestsWithState is Dstream
tempRequestsWithState.foreachRDD { rdd =>
if ((rdd != null) && (rdd.count() > 0) && (!rdd.isEmpty()) ) {
rdd.foreachPartition {
case (topicsTableName, hashKeyTemp, attributeValueUpdate) => {
val client = new AmazonDynamoDBClient()
val request = new UpdateItemRequest(topicsTableName, hashKeyTemp, attributeValueUpdate)
try client.updateItem(request)
catch {
case se: Exception => println("Error executing updateItem!\nTable ", se)
}
}
case null =>
}
}
}
From research learnt that using foreachpartition and creating a connection per partition will help ..but not sure how to go about writing code for it ..will greatly appreciate if someone can help with this ...Also any other suggestion to speed up writing is greatly appreciated

Each partition is handled by an executor(a jvm process). So inside that you can write code to initialize db connection and write to database. IN the code given, the line for first case() is where you write that code. So as you get more partitions, if you have multiple executors writing to db, this will be done in parallel.
rdd.foreachPartition {
//initialize database cnx
//write to db
//close connection
}

It is better to use a single partition to write in a db and singleton to initialize cnx, to decrease the numbers of db connection, in foreachPartition function use write with
batch to increase the numbers of the inserted lines.
rdd.repartition(1).foreachPartition {
//get singleton instance cnx
//write with batche to db
//close connection
}

Related

Cassandra Async write batch :Want to check data of which Cassandra Batch failed

I am using Cassandra Batches to write to in Cassandra Nodes
The batch size is 20.
List<List<SimpleStatement>> lists = Lists.partition(simpleStatementList,20);
List<ListenableFuture<CompletionStage<AsyncResultSet>>> futures = new ArrayList ();
try{
lists.forEach(list -> {
BatchStatementBuilder batchStatementBuilder =
BatchStatement.builder(BatchType.LOGGED);
list.forEach(batchStatementBuilder::addStatement);
futures.add(executorService.submit(() ->
session.executeAsync(batchStatementBuilder.build())));
});
}catch{
LOG.error("Error")
}finally {
Futures.allAsList(futures).get(1, TimeUnit.HOURS);
futures.forEach(val -> {
try {
val.get().whenCompleteAsync(
(resultSet, error) -> {
if (error != null) {
LOG.info("Fail to write Cassandra");
}
});
} catch (Exception e) {
throw new RuntimeException(e);
}
});
executorService.shutdown();
}
I want to know which Batches are failed when I write using Async method.
Thanks in advance
Within your code, you can catch the exception returned from the asynchronous request and handle it whichever way you want. For example, log the contents of the lists object.
As a side note, CQL batches are not an optimisation. In fact, they are expensive and will perform worse than separate asynchronous statements since they put pressure on the coordinator.
Given that you are bulk-inserting into the same table, this is bad practice and is not recommended in Cassandra. CQL batches are for achieving atomicity when updating the same record in multiple tables. Cheers!

Spark Structured Streaming redis sink perform not desirable

I've used spark structured streaming conume kafka messages and save data to redis. By extending the ForeachWriter[org.apache.spark.sql.Row], I used a redis sink to save data. The code runs well but just a little more than 100 datas be saved to redis per second. Is there any better way to speed up the procedure? While code like below would connect and disconnect to redis server every mico batch, any way to just connect once and keep the connections to miniminze the cost of connection which I supposed is the main cause of time consuming?
I tried broadcast jedis but neither jedis nor jedispool isserializable so it didn't work.
My sink code is below:
class StreamDataSink extends ForeachWriter[org.apache.spark.sql.Row]{
var jedis:Jedis = _
override def open(partitionId:Long,version:Long):Boolean={
if(null == jedis){
jedis = FPCRedisUtils.getPool.getResource
}
true
}
override def process(record: Row): Unit = {
if(0 == record(3)){
jedis.select(Constants.REDIS_DATABASE_INDEX)
if(jedis.exists("counter")){
jedis.incr("counter")
}else{
jedis.set("counter",1.toString)
}
}
}
override def close(errorOrNull: Throwable): Unit = {
if(null != jedis){
jedis.close()
jedis.disconnect()
}
}
Any suggestions will be appreciated.
Don't do jedis.disconnect(). This will actually close the socket, forcing a new connection next time around. Use only jedis.close(), it will return the connection to the pool.
When you call INCR on a non-existing key, it is automatically created, default to zero and then incremented, resulting in a new key with value 1.
This simplifies your if-else to simply jedis.incr("counter").
With this you have:
jedis.select(Constants.REDIS_DATABASE_INDEX)
jedis.incr("counter")
Review if you really need the SELECT. This is per connection and all connections default to DB 0. If all workloads sharing the same jedis pool are using DB 0, there is no need to call select.
If you do need both select and incr, then pipeline them:
Pipeline pipelined = jedis.pipelined()
pipelined.select(Constants.REDIS_DATABASE_INDEX)
pipelined.incr("counter")
pipelined.sync()
This will send the two commands in one network message, further improving your performance.

spark 2.2 struct streaming foreach writer jdbc sink lag

i'm in a project using spark 2.2 struct streaming to read kafka msg into oracle database. the message flow into kafka is about 4000-6000 messages per second .
when using hdfs file system as sink destination ,it just works fine. when using foreach jdbc writer,it will have a huge delay over time . I think the lag is caused by foreach loop .
the jdbc sink class(stand alone class file):
class JDBCSink(url: String, user: String, pwd: String) extends org.apache.spark.sql.ForeachWriter[org.apache.spark.sql.Row] {
val driver = "oracle.jdbc.driver.OracleDriver"
var connection: java.sql.Connection = _
var statement: java.sql.PreparedStatement = _
val v_sql = "insert INTO sparkdb.t_cf(EntityId,clientmac,stime,flag,id) values(?,?,to_date(?,'YYYY-MM-DD HH24:MI:SS'),?,stream_seq.nextval)"
def open(partitionId: Long, version: Long): Boolean = {
Class.forName(driver)
connection = java.sql.DriverManager.getConnection(url, user, pwd)
connection.setAutoCommit(false)
statement = connection.prepareStatement(v_sql)
true
}
def process(value: org.apache.spark.sql.Row): Unit = {
statement.setString(1, value(0).toString)
statement.setString(2, value(1).toString)
statement.setString(3, value(2).toString)
statement.setString(4, value(3).toString)
statement.executeUpdate()
}
def close(errorOrNull: Throwable): Unit = {
connection.commit()
connection.close
}
}
the sink part :
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "namenode:9092").option("fetch.message.max.bytes", "50000000").option("kafka.max.partition.fetch.bytes", "50000000")
.option("subscribe", "rawdb.raw_data")
.option("startingOffsets", "latest")
.load()
.select($"value".as[Array[Byte]])
.map(avroDeserialize(_))
.filter(some logic).select(some logic)
.writeStream.format("csv").option("checkpointLocation", "/user/root/chk").option("path", "/user/root/testdir").start()
if I change the last line
.writeStream.format("csv")...
into jdbc foreach sink as following:
val url = "jdbc:oracle:thin:#(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=x.x.x.x)(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME=fastdb)))"
val user = "user";
val pwd = "password";
val writer = new JDBCSink(url, user, pwd)
.writeStream.foreach(writer).outputMode("append").start()
the lag show up.
I guess the problem most likely caused by foreach loop mechanics-it's not in batch mode deal with like several thousands row in a batch ,as an oracle DBA either, I have fine tuned oracle database side ,mostly the database is waiting for idle events . excessive commit is trying to be avoided by setting connection.setAutoCommit(false) already,any suggestion will be much appreciate.
Although I don't have an actual profile of whats taking the longest time in your application, I would assume it is due to the fact that using ForeachWriter will effectively close and re-open your JDBC connection on each run, because that's how ForeachWriter works.
I would advise that instead of using it, write a custom Sink for JDBC where you control how the connection is opened or closed.
There is an open pull request to add a JDBC driver to Spark which you can take a peek at to see a possible approach to the implementation.
problem solved by injecting the result into another Kafka topic , then wrote another program read from the new topic write them into database on batches .
I think in next spark release,they might provide the jdbc sink and have some parameter setting batch size .
the main code is as following :
write to another topic:
.writeStream.format("kafka")
.option("kafka.bootstrap.servers", "x.x.x.x:9092")
.option("topic", "fastdbtest")
.option("checkpointLocation", "/user/root/chk")
.start()
read the topic and write to databases,i'm using c3p0 connection pool
lines.foreachRDD(rdd => {
if (!rdd.isEmpty) {
rdd.foreachPartition(partitionRecords => {
//get a connection from connection pool
val conn = ConnManager.getManager.getConnection
val ps = conn.prepareStatement("insert into sparkdb.t_cf(ENTITYID,CLIENTMAC,STIME,FLAG) values(?,?,?,?)")
try {
conn.setAutoCommit(false)
partitionRecords.foreach(record => {
insertIntoDB(ps, record)
}
)
ps.executeBatch()
conn.commit()
} catch {
case e: Exception =>{}
// do some log
} finally {
ps.close()
conn.close()
}
})
}
})
Have you tried using a trigger?
I notice when I didn't use a trigger my Foreach Sink open and close several times the connection to the database.
writeStream.foreach(writer).start()
But when I used a trigger, the Foreach only opened and closed the connection one time, processing for example 200 queries and when the micro-batch was ended it closed the connection until a new micro batch was received.
writeStream.trigger(Trigger.ProcessingTime("3 seconds")).foreach(writer).start()
My use case is reading from a Kafka topic with only one partition, so Spark I think is using one partition. I dont know if this solution works the same with multiple Spark partitions but my conclusion here is the Foreach process all the micro-batch at a time (row by row) in the process method and doesn't call open() and close() for every row like a lot of people think.

Does doing rdd.count() inside foreachRDD return results to Driver or Executor?

For the code below, does the .count() return the value back to the driver or only to the executor?
JavaPairDStream<String, String> dstream ...
stream.foreachRDD(rdd -> {
long count = rdd.count();
// some code to save count to Datastore
});
I know usually count() returns the value to the driver but I'm not sure what happens when it's inside foreacRDD?
For other related questions in the future, is there an easy way to verify if a code block executes on the driver or exeutor?
Operations that give access to an RDD, such as transform(rdd => ...) and foreachRDD(rdd => ...) execute in the context of the driver. The mind twist that gets confusing is that operations on that RDD will execute on the executors in the cluster.
For example:
stream.foreachRDD(rdd -> {
long count = rdd.count(); // the count is executed on the cluster, the result it brought back to the driver, like in core Spark
RDD<> richer = rdd.map(elem => something(elem)) // executes distributed
db.store(richer.top(10)) // executes in the driver
});

What is the correct way of using memSQL Connection object inside call method of Apache Spark code

I have a spark code where the code inside Call method makes call to the memSQL database for reading from a table. My code opens a new connection object each time and closes it after the task is done. This call is made from inside the Call method. This works fine but the execution time for Spark job becomes high. What would be a better way to do this so that the spark code execution time is reduced.
Thank You.
You can use one connection per partition, like this:
rdd.foreachPartition {records =>
val connection = DB.createConnection()
//you can use your connection instance inside foreach
records.foreach { r=>
val externalData = connection.read(r.externaId)
//do something with your data
}
DB.save(records)
connection.close()
}
If you use Spark Streaming:
dstream.foreachRDD { rdd =>
rdd.foreachPartition { records =>
val connection = DB.createConnection()
//you can use your connection instance inside foreach
records.foreach { r=>
val externalData = connection.read(r.externaId)
//do something with your data
}
DB.save(records)
connection.close()
}
}
See http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams

Resources