spark-streaming and connection pool implementation - apache-spark

The spark-streaming website at https://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams mentions the following code:
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection) // return to the pool for future reuse
}
}
I have tried to implement this using org.apache.commons.pool2 but running the application fails with the expected java.io.NotSerializableException:
15/05/26 08:06:21 ERROR OneForOneStrategy: org.apache.commons.pool2.impl.GenericObjectPool
java.io.NotSerializableException: org.apache.commons.pool2.impl.GenericObjectPool
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
...
I am wondering how realistic it is to implement a connection pool that is serializable. Has anyone succeeded in doing this ?
Thank you.

To address this "local resource" problem what's needed is a singleton object - i.e. an object that's warranted to be instantiated once and only once in the JVM. Luckily, Scala object provides this functionality out of the box.
The second thing to consider is that this singleton will provide a service to all tasks running on the same JVM where it's hosted, so, it MUST take care of concurrency and resource management.
Let's try to sketch(*) such service:
class ManagedSocket(private val pool: ObjectPool, val socket:Socket) {
def release() = pool.returnObject(socket)
}
// singleton object
object SocketPool {
var hostPortPool:Map[(String, Int),ObjectPool] = Map()
sys.addShutdownHook{
hostPortPool.values.foreach{ // terminate each pool }
}
// factory method
def apply(host:String, port:String): ManagedSocket = {
val pool = hostPortPool.getOrElse{(host,port), {
val p = ??? // create new pool for (host, port)
hostPortPool += (host,port) -> p
p
}
new ManagedSocket(pool, pool.borrowObject)
}
}
Then usage becomes:
val host = ???
val port = ???
stream.foreachRDD { rdd =>
rdd.foreachPartition { partition =>
val mSocket = SocketPool(host, port)
partition.foreach{elem =>
val os = mSocket.socket.getOutputStream()
// do stuff with os + elem
}
mSocket.release()
}
}
I'm assuming that the GenericObjectPool used in the question is taking care of concurrency. Otherwise, access to each pool instance need to be guarded with some form of synchronization.
(*) code provided to illustrate the idea on how to design such object - needs additional effort to be converted into a working version.

Below answer is wrong!
I'm leaving the answer here for reference, but the answer is wrong for the following reason. socketPool is declared as a lazy val so it will get instantiated with each first request for access. Since the SocketPool case class is not Serializable, this means that it will get instantiated within each partition. Which makes the connection pool useless because we want to keep connections across partitions and RDDs. It makes no difference wether this is implemented as a companion object or as a case class. Bottom line is: the connection pool must be Serializable, and apache commons pool is not.
import java.io.PrintStream
import java.net.Socket
import org.apache.commons.pool2.{PooledObject, BasePooledObjectFactory}
import org.apache.commons.pool2.impl.{DefaultPooledObject, GenericObjectPool}
import org.apache.spark.streaming.dstream.DStream
/**
* Publish a Spark stream to a socket.
*/
class PooledSocketStreamPublisher[T](host: String, port: Int)
extends Serializable {
lazy val socketPool = SocketPool(host, port)
/**
* Publish the stream to a socket.
*/
def publishStream(stream: DStream[T], callback: (T) => String) = {
stream.foreachRDD { rdd =>
rdd.foreachPartition { partition =>
val socket = socketPool.getSocket
val out = new PrintStream(socket.getOutputStream)
partition.foreach { event =>
val text : String = callback(event)
out.println(text)
out.flush()
}
out.close()
socketPool.returnSocket(socket)
}
}
}
}
class SocketFactory(host: String, port: Int) extends BasePooledObjectFactory[Socket] {
def create(): Socket = {
new Socket(host, port)
}
def wrap(socket: Socket): PooledObject[Socket] = {
new DefaultPooledObject[Socket](socket)
}
}
case class SocketPool(host: String, port: Int) {
val socketPool = new GenericObjectPool[Socket](new SocketFactory(host, port))
def getSocket: Socket = {
socketPool.borrowObject
}
def returnSocket(socket: Socket) = {
socketPool.returnObject(socket)
}
}
which you can invoke as follows:
val socketStreamPublisher = new PooledSocketStreamPublisher[MyEvent](host = "10.10.30.101", port = 29009)
socketStreamPublisher.publishStream(myEventStream, (e: MyEvent) => Json.stringify(Json.toJson(e)))

Related

Is it possible for the database to block parallel table accesses in Scala threads?

In my Scala application, I make several threads. In each thread, I write different data from the array to the same PostgreSQL table. I noticed that some threads did not write data to the PostgreSQL table. However, there are no errors in the application logs. Is it possible for the database to block parallel table accesses? What can be the cause of this behavior?
MainApp.scala:
val postgreSQL = new PostgreSQL(configurations)
val semaphore = new Semaphore(5)
for (item <- array) {
semaphore.acquire()
val thread = new Thread(new CustomThread(postgreSQL, semaphore, item))
thread.start()
}
CustomThread.scala:
import java.util.concurrent.Semaphore
import java.util.UUID.randomUUID
import utils.PostgreSQL
class CustomThread(postgreSQL: PostgreSQL, semaphore: Semaphore, item: Item) extends Runnable {
override def run(): Unit = {
try {
// Create the unique filename.
val filename: String = randomUUID().toString
// Write to the database the filename of the item.
postgreSQL.changeItemFilename(filename, item.id)
// Change the status type of the item.
postgreSQL.changeItemStatusType(3, request.id)
} catch {
case e: Throwable =>
e.printStackTrace()
} finally {
semaphore.release()
}
}
}
PostgreSQL.scala:
package utils
import java.sql.{Connection, DriverManager, PreparedStatement, ResultSet}
import java.util.Properties
class PostgreSQL(configurations: Map[String, String]) {
val host: String = postgreSQLConfigurations("postgresql.host")
val port: String = postgreSQLConfigurations("postgresql.port")
val user: String = postgreSQLConfigurations("postgresql.user")
val password: String = postgreSQLConfigurations("postgresql.password")
val db: String = postgreSQLConfigurations("postgresql.db")
val url: String = "jdbc:postgresql://" + host + ":" + port + "/" + db
val driver: String = "org.postgresql.Driver"
val properties = new Properties()
val connection: Connection = getConnection
var statement: PreparedStatement = _
def getConnection: Connection = {
properties.setProperty("user", user)
properties.setProperty("password", password)
var connection: Connection = null
try {
Class.forName(driver)
connection = DriverManager.getConnection(url, properties)
} catch {
case e:Exception =>
e.printStackTrace()
}
connection
}
def changeItemFilename(filename: String, id: Int): Unit = {
try {
statement = connection.prepareStatement("UPDATE REPORTS SET FILE_NAME = ? WHERE ID = ?;", ResultSet.TYPE_SCROLL_INSENSITIVE, ResultSet.CONCUR_READ_ONLY)
statement.setString(1, filename)
statement.setInt(2, id)
statement.execute()
} catch {
case e: Exception =>
e.printStackTrace()
}
}
}
Just for your interest, by default, JDBC is synchronous. It means that it blocks your thread until the operation is done on a specific connection. This means if you try to do multiple things on a single connection at the same time, actions will be done sequentially instead.
More on that:
https://dzone.com/articles/myth-asynchronous-jdbc
That's the first and the most probable reason. The second possible reason, database blocks modifying actions on table cells which are being updated by another transaction, how exactly - depends on isolation level.
https://www.sqlservercentral.com/articles/isolation-levels-in-sql-server
That's the second probable reason.
The last, but not least, it is not necessary to use bare threads in Scala. For concurrent/asynchronous programming many of libraries like cats-effects, monix, zio, was developed, and there are special libraries for database access using these libraries like slick or doobie.
It's better to use them than bare threads due to numerous reasons.

Count written records in Spark with threads

I am using onTaskEnd Spark listener to get the number of records written into file like this:
import spark.implicits._
import org.apache.spark.sql._
import org.apache.spark.scheduler.{SparkListener, SparkListenerTaskEnd}
var recordsWritten: Long = 0L
val rowCountListener: SparkListener = new SparkListener() {
override def onTaskEnd(taskEnd: SparkListenerTaskEnd) {
synchronized {
recordsWritten += taskEnd.taskMetrics.outputMetrics.recordsWritten
}
}
}
def rowCountOf(proc: => Unit): Long = {
recordsWritten = 0L
spark.sparkContext.addSparkListener(rowCountListener)
try {
proc
} finally {
spark.sparkContext.removeSparkListener(rowCountListener)
}
recordsWritten
}
val rc = rowCountOf { (1 to 100).toDF.write.csv(s"test.csv") }
println(rc)
=> 100
However trying to run multiple actions in threads this obviously breaks:
Seq(1, 2, 3).par.foreach { i =>
val rc = rowCountOf { (1 to 100).toDF.write.csv(s"test${i}.csv") }
println(rc)
}
=> 600
=> 700
=> 750
I can have each thread declare its own variable, but spark context is still shared and I am unable to reckognize to which thread does specific SparkListenerTaskEnd event belong to. Is there any way to make it work?
(Right, maybe I could just make it separate spark jobs. But it's just a single piece of the program, so for the sake of simplicity I would prefer to stay with threads. In the worst case I'll just execute it serially or forget about counting records...)
A bit hackish but you could use accumulators as a filtering side-effect
val acc = spark.sparkContext.longAccumulator("write count")
df.filter { _ =>
acc.add(1)
true
}.write.csv(...)
println(s"rows written ${acc.count}")

Number of threads in Akka keep increasing. What could be wrong?

Why does the thread count keep increasing ?
LOOK AT THE BOTTOM RIGHT in this image.
The overall flow is like this:
Akka HTTP Server API
-> on http request, sendMessageTo DataProcessingActor
-> sendMessageTo StorageActor
-> sendMessageTo DataBaseActor
-> sendMessageTo IndexActor
This is the definition of Akka HTTP API ( in pseudo-code ):
Main {
path("input/") {
post {
dataProcessingActor forward message
}
}
}
Below are the actor definitions ( in pseudo-code ):
DataProcessingActor {
case message =>
message = parse message
storageActor ! message
}
StorageActor {
case message =>
indexActor ! message
databaseActor ! message
}
DataBaseActor {
case message =>
val c = get monogCollection
c.store(message)
}
IndexActor {
case message =>
elasticSearch.index(message)
}
After I run this setup, and on sending multiple HTTP requsts to "input/" HTTP endpoint, I get errors:
for( i <- 0 until 1000000) {
post("input/", someMessage+i)
}
Error:
[ERROR] [04/22/2016 13:20:54.016] [Main-akka.actor.default-dispatcher-15] [akka.tcp://Main#127.0.0.1:2558/system/IO-TCP/selectors/$a/0] Accept error: could not accept new connection
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
at akka.io.TcpListener.acceptAllPending(TcpListener.scala:107)
at akka.io.TcpListener$$anonfun$bound$1.applyOrElse(TcpListener.scala:82)
at akka.actor.Actor$class.aroundReceive(Actor.scala:480)
at akka.io.TcpListener.aroundReceive(TcpListener.scala:32)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
EDIT 1
Here is the application.conf file being used:
akka {
loglevel = "INFO"
stdout-loglevel = "INFO"
logging-filter = "akka.event.slf4j.Slf4jLoggingFilter"
actor {
default-dispatcher {
throughput = 10
}
}
actor {
provider = "akka.remote.RemoteActorRefProvider"
}
remote {
enabled-transports = ["akka.remote.netty.tcp"]
netty.tcp {
hostname = "127.0.0.1"
port = 2558
}
}
}
I figured out that ElasticSearch was the problem. I am using Java API for ElasticSearch, and that is leaking sockets because of the way it was being used from Java API. Now resolved as described here.
Below is the Elastic Search client service using Java API
trait ESClient { def getClient(): Client }
case class ElasticSearchService() extends ESClient {
def getClient(): Client = {
val client = new TransportClient().addTransportAddress(
new InetSocketTransportAddress(Config.ES_HOST, Config.ES_PORT)
)
client
}
}
This is the actor which was causing the leak:
class IndexerActor() extends Actor {
val elasticSearchSvc = new ElasticSearchService()
lazy val client = elasticSearchSvc.getClient()
override def preStart = {
// initialize index, and mappings etc.
}
def receive() = {
case message =>
// do indexing here
indexMessage(ES.client, message)
}
}
NOTE: Every time an actor instance is created, a new connection is being made.
Every invocation of new ElasticSearchService() was creating a new connection to ElasticSearch. I moved that into a separate object as shown below, and also the actor uses this object instead:
object ES {
val elasticSearchSvc = new ElasticSearchService()
lazy val client = elasticSearchSvc.getClient()
}
class IndexerActor() extends Actor {
override def preStart = {
// initialize index, and mappings etc.
}
def receive() = {
case message =>
// do indexing here
indexMessage(ES.client, message)
}
}

Nicifying execution contex's thread pool's output for logging/debuging in scala

Is there is nice way to rename a pool in/for an executon context to produce nicer output in logs/wile debugging. Not to be look like ForkJoinPool-2-worker-7 (because ~2 tells nothing about pool's purose in app) but WorkForkJoinPool-2-worker-7.. wihout creating new WorkForkJoinPool class for it?
Example:
object LogSample extends App {
val ex1 = ExecutionContext.global
val ex2 = ExecutionContext.fromExecutor(null:Executor) // another global ex context
val system = ActorSystem("system")
val log = Logging(system.eventStream, "my.nice.string")
Future {
log.info("1")
}(ex1)
Future {
log.info("2")
}(ex2)
Thread.sleep(1000)
// output, like this:
/*
[INFO] [09/14/2015 21:53:34.897] [ForkJoinPool-2-worker-7] [my.nice.string] 2
[INFO] [09/14/2015 21:53:34.897] [ForkJoinPool-1-worker-7] [my.nice.string] 1
*/
}
You need to implement custom thread factory, something like this:
class CustomThreadFactory(prefix: String) extends ForkJoinPool.ForkJoinWorkerThreadFactory {
def newThread(fjp: ForkJoinPool): ForkJoinWorkerThread = {
val thread = new ForkJoinWorkerThread(fjp) {}
thread.setName(prefix + "-" + thread.getName)
thread
}
}
val threadFactory = new CustomThreadFactory("custom prefix here")
val uncaughtExceptionHandler = new UncaughtExceptionHandler {
override def uncaughtException(t: Thread, e: Throwable) = e.printStackTrace()
}
val executor = new ForkJoinPool(10, threadFactory, uncaughtExceptionHandler, true)
val ex2 = ExecutionContext.fromExecutor(executor) // another global ex context
val system = ActorSystem("system")
val log = Logging(system.eventStream, "my.nice.string")
Future {
log.info("2") //[INFO] [09/15/2015 18:22:43.728] [custom prefix here-ForkJoinPool-1-worker-29] [my.nice.string] 2
}(ex2)
Thread.sleep(1000)
Ok. Seems this is not possible (particulary for default global iml) due to current scala ExecutonContext implementation.
What I could do is just copy that impl and replace:
class DefaultThreadFactory(daemonic: Boolean) ... {
def wire[T <: Thread](thread: T): T = {
thread.setName("My" + thread.getId) // ! add this one (make 'My' to be variable)
thread.setDaemon(daemonic)
thread.setUncaughtExceptionHandler(uncaughtExceptionHandler)
thread
}...
because threadFactory there
val threadFactory = new DefaultThreadFactory(daemonic = true)
is harcoded ...
(seems Vladimir Petrosyan was first showing nicer way :) )

Scala future and its callback works in the same execution context

I call def activateReward by Akka actors and execution OracleClient.rewardActivate(user) sometimes is very slow (the database is outside of my responsibility and belongs to another company).
When database is slow the thread pool is exhausted and can not effectively allocate more threads to run callbacks future.onComplete because callbacks and futures works in the same execution context.
Please advise how to execute code in the callback asynchronously from threads which allocated for futures OracleClient.rewardActivate(user)
class RewardActivatorHelper {
private implicit val ec = new ExecutionContext {
val threadPool = Executors.newFixedThreadPool(1000)
def execute(runnable: Runnable) {threadPool.submit(runnable)}
def reportFailure(t: Throwable) {throw t}
}
case class FutureResult(spStart:Long, spFinish:Long)
def activateReward(msg:Msg, time:Long):Unit = {
msg.users.foreach {
user =>
val future:Future[FutureResult] = Future {
val (spStart, spFinish) = OracleClient.rewardActivate(user)
FutureResult(spStart, spFinish)
}
future.onComplete {
case Success(futureResult:FutureResult) =>
futureResult match {
case res:FutureResult => Logger.writeToLog(Logger.LogLevel.DEBUG,s"started:${res.spStart}finished:${res.spFinish}")
case _ => Logger.writeToLog(Logger.LogLevel.DEBUG, "some error")
}
case Failure(e:Throwable) => Logger.writeToLog(Logger.LogLevel.DEBUG, e.getMessage)
}
}
}
}
You can specify the execution context explicitly instead of implicitly for the onComplete callback by doing something along these lines:
import java.util.concurrent.Executors
import scala.concurrent.duration.Duration
object Example extends App {
import scala.concurrent._
private implicit val ec = new ExecutionContext {
val threadPool = Executors.newFixedThreadPool(1000)
def execute(runnable: Runnable) {threadPool.submit(runnable)}
def reportFailure(t: Throwable) {throw t}
}
val f = Future {
println("from future")
}
f.onComplete { _ =>
println("I'm done.")
}(scala.concurrent.ExecutionContext.Implicits.global)
Await.result(f, Duration.Inf)
}
This will of course not solve the underlying problem of a database not keeping up, but might be good to know anyway.
To clarify: I let the onComplete callback be handled by the standard global execution context. You might want to create a separate one.

Resources