How to configure a fine tuned thread pool for futures? - multithreading

How large is Scala's thread pool for futures?
My Scala application makes many millions of future {}s and I wonder if there is anything I can do to optimize them by configuring a thread pool.
Thank you.

This answer is from monkjack, a comment from the accepted answer. However, one can miss this great answer so I'm reposting it here.
implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(10))
If you just need to change the thread pool count, just use the global executor and pass the following system properties.
-Dscala.concurrent.context.numThreads=8 -Dscala.concurrent.context.maxThreads=8

You can specify your own ExecutionContext that your futures will run in, instead of importing the global implicit ExecutionContext.
import java.util.concurrent.Executors
import scala.concurrent._
implicit val ec = new ExecutionContext {
val threadPool = Executors.newFixedThreadPool(1000)
def execute(runnable: Runnable) {
threadPool.submit(runnable)
}
def reportFailure(t: Throwable) {}
}

best way to specify threadpool in scala futures:
implicit val ec = new ExecutionContext {
val threadPool = Executors.newFixedThreadPool(conf.getInt("5"));
override def reportFailure(cause: Throwable): Unit = {};
override def execute(runnable: Runnable): Unit = threadPool.submit(runnable);
def shutdown() = threadPool.shutdown();
}

class ThreadPoolExecutionContext(val executionContext: ExecutionContext)
object ThreadPoolExecutionContext {
val executionContextProvider: ThreadPoolExecutionContext = {
try {
val executionContextExecutor: ExecutionContextExecutor = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(25))
new ThreadPoolExecutionContext(executionContextExecutor)
} catch {
case exception: Exception => {
Log.error("Failed to create thread pool", exception)
throw exception
}
}
}
}

Related

run several coroutines in parallel (with return value)

I'm new with kotlin I'm trying to run several requests to a web in parallel threads
so far I got
class HttpClient {
private val DEFAULT_BASE_URL = "https://someapi"
fun fetch(endPoint: String, page: Int): String {
FuelManager.instance.basePath = DEFAULT_BASE_URL
val (_, response, _) = endPoint.httpGet(listOf("page" to page)).response()
return String(response.data)
}
fun headers(endPoint: String): Headers {
FuelManager.instance.basePath = DEFAULT_BASE_URL
val (_, response, _) = endPoint.httpGet(listOf("page" to 1)).response()
return response.headers
}
}
and the class that runs the whole process
class Fetcher(private val page: Int) {
suspend fun run(): String = coroutineScope {
async {
HttpClient().fetch(DEFAULT_ENDPOINT, page)
}
}.await()
companion object {
private const val DEFAULT_ENDPOINT = "endpoint"
suspend fun fetchAll(): MutableList<String> {
val totalThreads = (totalCount() / pageSize()) + 1
return runBlocking {
var deck: MutableList<String> = mutableListOf()
for (i in 1..totalThreads) {
deck.add(Fetcher(i).run())
}
deck
}
}
private fun pageSize(): Int {
return HttpClient().headers(DEFAULT_ENDPOINT)["page-size"].first().toInt()
}
private fun totalCount(): Int {
return HttpClient().headers(DEFAULT_ENDPOINT)["total-count"].first().toInt()
}
}
}
I'm looking to mirror the Thread.join() from Java. Could you give me some pointers on how to improve my code to achieve that?
Also if not much asking, could you suggest a book/example set on this subject?
Thanks for your help in advance!
A few points:
If you're going to be using coroutines in a project, you'll mostly want to be exposing suspending functions instead of blocking functions. I don't use Fuel, but I see it has a coroutines library with suspend function versions of its blocking functions. Usually, suspend functions that unwrap an asynchronous result have the word "await" in them. I don't know for sure what response() is since I don't use fuel, but if I had to guess, you can use awaitResponse() instead and then make the functions suspend functions.
Not related to coroutines, but there's almost no reason to ever use the String constructor to wrap another String, since Strings are immutable. (The only reason you would ever need to copy a String in memory like that is maybe if you were using it in some kind of weird collection that uses identity comparison instead of `==`` comparison, and you need it to be treated as a different value.)
Also not related to coroutines, but HttpClient in your case should be a singleton object since it holds no state. Then you won't need to instantiate it when you use it or worry about holding a reference to one in a property.
Never use runBlocking in a suspend function. A suspend function must never block. runBlocking creates a blocking function. The only two places runBlocking should ever appear in an application are at the top level main function of a CLI app, or in an app that has both coroutines and some other thread-management library and you need to convert suspend functions into blocking non-suspend functions so they can be used by the non-coroutine-based code.
There's no reason to immediately follow async() with await() if you aren't doing it in parallel with something else. You could just use withContext instead. If you don't need to use a specific dispatcher to call the code, which you don't if it's a suspend function, then you don't even need withContext. You can just call suspend functions directly in your coroutine.
There's no reason to use coroutineScope { } to wrap a single child coroutine. It's for running multiple child coroutines and waiting for all of them.
So, if we change HttpClient's functions into suspend functions, then Fetcher.run becomes very simple.
I also think that it's kind of weird that Fetcher is a class with a single property that is only used in a one-off fashion with its only function. Instead, it would be more straight-forward for Fetcher to be a singleton object and for run to have the parameter it needs. Then you won't need a companion object either since Fetcher as an object can directly host those functions.
Finally, the part you were actually asking about: to run parallel tasks in a coroutine, use coroutineScope { } and then launch async coroutines inside it and await them. The map function is handy for doing this with something you can iterate, and then you can use awaitAll(). You can also get totalCount and pageSize in parallel.
Bringing that all together:
object HttpClient {
private val DEFAULT_BASE_URL = "https://someapi"
suspend fun fetch(endPoint: String, page: Int): String {
FuelManager.instance.basePath = DEFAULT_BASE_URL
val (_, response, _) = endPoint.httpGet(listOf("page" to page)).awaitResponse()
return response.data
}
suspend fun headers(endPoint: String): Headers {
FuelManager.instance.basePath = DEFAULT_BASE_URL
val (_, response, _) = endPoint.httpGet(listOf("page" to 1)).awaitResponse()
return response.headers
}
}
object Fetcher() {
suspend fun run(page: Int): String =
HttpClient.fetch(DEFAULT_ENDPOINT, page)
private const val DEFAULT_ENDPOINT = "endpoint"
suspend fun fetchAll(): List<String> {
val totalThreads = coroutineScope {
val totalCount = async { totalCount() }
val pageSize = async { pageSize() }
(totalCount.await() / pageSize.await()) + 1
}
return coroutineScope {
(1..totalThreads).map { i ->
async { run(i) }
}.awaitAll()
}
}
private suspend fun pageSize(): Int {
return HttpClient.headers(DEFAULT_ENDPOINT)["page-size"].first().toInt()
}
private suspend fun totalCount(): Int {
return HttpClient.headers(DEFAULT_ENDPOINT)["total-count"].first().toInt()
}
}
I changed MutableList to List, since it's simpler, and usually you don't need a MutableList. If you really need one you can call toMutableList() on it.

Writing tests to use Kafka consumer in multi-threading environment

I am trying to create a kafka consumer in a separate thread which consumes data from kafka topic. For this, I have extended ShutdownableThread abstract class and provided implementation for doWork method. My code is like this -
abstract class MyConsumer(topic: String) extends ShutdownableThread(topic) {
val props: Properties = ???
private val consumer = new KafkaConsumer[String, String](props)
consumer.subscribe(List(topic).asJava)
def process(value: String): Unit // Abstract method defining what to do with each record
override def doWork(): Unit = {
for (record <- consumer.poll(Duration.ofMillis(1000)).asScala)
process(record.value())
}
}
Now in my tests, I create consumer providing implementation of process() method which just mutates a variable and then call start() method of it to start the thread.
var mutVar = "initial_value"
val consumer = new MyConsumer("test_topic") {
override def process(value: String): Unit = mutVar = "updated_value"
}
consumer.start()
assert(mutVar === "updated_value")
The Consumer does consume the message from the kafka but it does not update it before the test finishes and hence the test fails. So, I tried to put the main thread on sleep. But it throws ConcurrentModificationException exception with the message - KafkaConsumer is not safe for multi-threaded access
Any idea what is wrong with my approach ? Thanks in advance.
Had to put main thread to sleep for few seconds to allow consumer to consume message from kafka topic and store it in the mutable variable. Added - Thread.sleep(5000) after starting the consumer.

Get "java.lang.NoClassDefFoundError" when running spark project with spark-submit on multiple machines

I'm a beginner of scala/spark and I got stuck when shipping my code to the official environment.
To be short, I can't put my SparkSession object in class method and I don't know why? If I do so, it will be fine when I run it on a local single machine but throw java.lang.NoClassDefFoundError, Could not initialize class XXX when I package my code to a single jar file and run it on multiple machines using spark-submit.
For example
When I put my code in structure like this
object Main{
def main(...){
Task.start
}
}
object Task{
case class Data(name:String, ...)
val spark = SparkSession.builder().appName("Task").getOrCreate()
import spark.implicits._
def start(){
var ds = loadFile(path)
ds.map(someMethod) // it dies here!
}
def loadFile(path:String){
spark.read.schema(...).json(path).as[Data]
}
def someMethod(d:Data):String{
d.name
}
}
It will give me "java.lang.NoClassDefFoundError" on each places where I put a self-defined method in those dataset transformation functions (like map, filter... etc).
However, if I rewrite it as
object Task{
case class Data(name:String, ...)
def start(){
val spark = SparkSession.builder().appName("Task").getOrCreate()
import spark.implicits._
var ds = loadFile(spark, path)
ds.map(someMethod) // it works!
}
def loadFile(spark:SparkSession, path:String){
import spark.implicits._
spark.read.schema(...).json(path).as[Data]
}
def someMethod(d:Data):String{
d.name
}
}
It will be fine, but it means that I need to pass the "spark" variable through each of methods that I will need it and I need to write import spark.implicits._ all the time when a method need it.
I think something goes wrong when the spark try to shuffle my object between nodes, but I don't know how exactly the reason is and what is the correct way to write my code.
Thanks
No you don't need to pass sparkSession object and import implicits in all the methods you need. You can make the sparkSession variable as a object variable outside a function and use in all the functions.
Below is the modified example of your code which
object Main{
def main(args: Array[String]): Unit = {
Task.start()
}
}
object Task{
case class Data(fname:String, lname : String)
val spark = SparkSession.builder().master("local").appName("Task").getOrCreate()
import spark.implicits._
def start(){
var ds = loadFile("person.json")
ds.map(someMethod).show()
}
def loadFile(path:String): Dataset[Data] = {
spark.read.json(path).as[Data]
}
def someMethod(d:Data):String = {
d.fname
}
}
Hope this helps!

Deadlocks with Play 2.5 scala, deadbolt 2.5 and MongoDB/Morphia

Basically all I want to do is getting all users from my database, which worked fine until the very moment i wanted to use deadbolt for it:
I think the 4 Threads(number of processors) of the fork-join-executor are already all used and then there is somekind of deadlock.
Things I tried:
Raise the number of threads the executor has, so however play/akka ignores my settings
Define another execution context for the futures in the controller, but this does not prevent deadlocks since more than four threads still wait at each other
use a thread-pool-executor, but my settings are ignored
A mixed scala/java code from here:
class UserController {
def getUserList = deadbolt.Restrict(List(Array("Admin")))(){ implicit request =>
Future {
val users = userModel.list
val json = Json.toJson(users)
Ok(json.toString)
}(
}
}
The User Model is essentially nothing more than:
public class UserModel {
private MongoClient client = new MongoClient();
private Morphia morphia = new Morphia();
protected Datastore datastore = morphia.createDatastore(client, "timetracking");
public List<User> list(){
return datastore.find(User.class).asList();
}
public User findUserByName(String name){
User found = datastore.createQuery(User.class).field("username").equal(name).get();
return found;
}
}
Authorization Handler:
class AuthorizationHandler extends DeadboltHandler {
val model = new UserModel
override def getSubject[A](request: AuthenticatedRequest[A]): Future[Option[Subject]] =
Future {
blocking {
request.subject match {
case Some(user) =>
request.subject
case None =>
val username = request.session.get("username")
if (username.isDefined) {
val user = model.findUserByName(username.get)
if (user == null) {
None
} else {
val subject = new ScalaSubject(user.getUsername, user.getRole)
Some(subject)
}
} else {
None
}
}
}
}
Defining a seperate deadbolt context does not help:
package deadbolt.scala
import be.objectify.deadbolt.scala.DeadboltExecutionContextProvider
import be.objectify.deadbolt.scala.cache.HandlerCache
import play.api.inject.{Binding, Module}
import play.api.{Configuration, Environment}
class DeadBoldModule extends Module {
override def bindings(environment: Environment,
configuration: Configuration): Seq[Binding[_]] = Seq(
bind[HandlerCache].to[TimeTrackerHandelCache],
bind[DeadboltExecutionContextProvider].to[ThreadPoolProvider]
)
}
Custom context provider:
package deadbolt.scala
import java.io.InvalidObjectException
import java.util.concurrent.Executors
import be.objectify.deadbolt.scala.DeadboltExecutionContextProvider
import scala.concurrent.ExecutionContext
class ThreadPoolProvider extends DeadboltExecutionContextProvider {
override def get(): ExecutionContext = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(100))
}
When I try this, throwing some random exception, it is never thrown:
package deadbolt.scala
import java.io.InvalidObjectException
import java.util.concurrent.Executors
import be.objectify.deadbolt.scala.DeadboltExecutionContextProvider
import scala.concurrent.ExecutionContext
class ThreadPoolProvider extends DeadboltExecutionContextProvider {
override def get(): ExecutionContext = throw new IllegalAccessError("asd");ExecutionContext.fromExecutor(Executors.newFixedThreadPool(100))
}
It was not deadbolts fault, but however the MongoClient opened a new thread when it was instantiated.Which happened in our project quite often, but was not closed properly thereby blocking the threadpool. We used a Singleton and everything worked fine.

calling akka scheduler from callback function running on thread outside of akka

How can I call the akka scheduler from a thread outside of the Akka system?
I have an actor that will instantiate a few java classes, which have callbacks that are run on their own threads. I'd like to schedule execution from these callbacks:
pseudocode:
class GPIOActor extends Actor with PubSubActor {
class DebouncerListener(pin: Pin) extends GpioPinListenerDigital {
override def stateChangeEvent(event: GpioPinDigitalStateChangeEvent) {
if (notAlreadyWaitingForBounceToFinish)
context.system.scheduler.scheduleOnce(debouncePollPeriod, new Runnable() {
def run(): Unit = {
if (stillBouncing) {
context.system.scheduler.scheduleOnce(debouncePollPeriod, this)(context.dispatcher)
} else self ! PinStateChangeEvent(pin.getValue)
}
}
})(context.dispatcher)
}
val myPin = GPIOPinListener(RapiPin)
myPin.registerCallback(new DebouncerListener(myPin))
(...)
def receive: Receive = {
case PinStateChangeEvent(newValue) => notifySubscribers(newValue)
}
}

Resources