Custom synchronization using java.util.concurrent with cats.effect - multithreading

I have a requirements of pretty custom non-trivial synchronization which can be implemented with a fair ReentrantLock and Phaser. It does not seem to be possible (without a non-trivial customization) to implement on fs2 and cats.effect.
Since it's required to wrap all blocking operation into a Blocker here is the code:
private val l: ReentrantLock = new ReentrantLock(true)
private val c: Condition = l.newCondition
private val b: Blocker = //...
//F is declared on the class level
def lockedMutex(conditionPredicate: Int => Boolean): F[Unit] = blocker.blockOn {
Sync[F].delay(l.lock()).bracket(_ => Sync[F].delay{
while(!conditionPredicate(2)){
c.await()
}
})(_ => Sync[F].delay(l.unlock()))
}
QUESTION:
Is it guaranteed that the code containing c.await() will be executed in the same Thread which acquires/releases the ReentrantLock?
This is a crucial part since if it's not IllegalMonitorStateException will be thrown.

You really do not need to worry about threads when using something like cats-effect, rather you can describe your problem on a higher level.
This should get the same behavior you want, it will be running high-priority jobs until there isn't more to then pick low-priority jobs. After finishing a low-priority job each fiber will first check if there are more high-priority jobs before trying to pick again a low-priority one:
import cats.effect.Async
import cats.effect.std.Queue
import cats.effect.syntax.all._
import cats.syntax.all._
import scala.concurrent.ExecutionContext
object HighLowPriorityRunner {
final case class Config[F[_]](
highPriorityJobs: Queue[F, F[Unit]],
lowPriorityJobs: Queue[F, F[Unit]],
customEC: Option[ExecutionContext]
)
def apply[F[_]](config: Config[F])
(implicit F: Async[F]): F[Unit] = {
val processOneJob =
config.highPriorityJobs.tryTake.flatMap {
case Some(hpJob) => hpJob
case None => config.lowPriorityJobs.tryTake.flatMap {
case Some(lpJob) => lpJob
case None => F.unit
}
}
val loop: F[Unit] = processOneJob.start.foreverM
config.customEC.fold(ifEmpty = loop)(ec => loop.evalOn(ec))
}
}
You can use the customEC to provide your own ExecutionContext to control the number of real threads that are running your fibers under the hood.
The code can be used like this:
import cats.effect.{Async, IO, IOApp, Resource}
import cats.effect.std.Queue
import cats.effect.syntax.all._
import cats.syntax.all._
import java.util.concurrent.Executors
import scala.concurrent.ExecutionContext
import scala.concurrent.duration._
object Main extends IOApp.Simple {
override final val run: IO[Unit] =
Resource.make(IO(Executors.newFixedThreadPool(2)))(ec => IO.blocking(ec.shutdown())).use { ec =>
Program[IO](ExecutionContext.fromExecutor(ec))
}
}
object Program {
private def createJob[F[_]](id: Int)(implicit F: Async[F]): F[Unit] =
F.delay(println(s"Starting job ${id} on thread ${Thread.currentThread.getName}")) *>
F.delay(Thread.sleep(1.second.toMillis)) *> // Blocks the Fiber! - Only for testing, use F.sleep on real code.
F.delay(println(s"Finished job ${id}!"))
def apply[F[_]](customEC: ExecutionContext)(implicit F: Async[F]): F[Unit] = for {
highPriorityJobs <- Queue.unbounded[F, F[Unit]]
lowPriorityJobs <- Queue.unbounded[F, F[Unit]]
runnerFiber <- HighLowPriorityRunner(HighLowPriorityRunner.Config(
highPriorityJobs,
lowPriorityJobs,
Some(customEC)
)).start
_ <- List.range(0, 10).traverse_(id => highPriorityJobs.offer(createJob(id)))
_ <- List.range(10, 15).traverse_(id => lowPriorityJobs.offer(createJob(id)))
_ <- F.sleep(5.seconds)
_ <- List.range(15, 20).traverse_(id => highPriorityJobs.offer(createJob(id)))
_ <- runnerFiber.join.void
} yield ()
}
Which should produce an output like this:
Starting job 0 on thread pool-1-thread-1
Starting job 1 on thread pool-1-thread-2
Finished job 0!
Finished job 1!
Starting job 2 on thread pool-1-thread-1
Starting job 3 on thread pool-1-thread-2
Finished job 2!
Finished job 3!
Starting job 4 on thread pool-1-thread-1
Starting job 5 on thread pool-1-thread-2
Finished job 4!
Finished job 5!
Starting job 6 on thread pool-1-thread-1
Starting job 7 on thread pool-1-thread-2
Finished job 6!
Finished job 7!
Starting job 8 on thread pool-1-thread-1
Starting job 9 on thread pool-1-thread-2
Finished job 8!
Finished job 9!
Starting job 10 on thread pool-1-thread-1
Starting job 11 on thread pool-1-thread-2
Finished job 10!
Finished job 11!
Starting job 15 on thread pool-1-thread-1
Starting job 16 on thread pool-1-thread-2
Finished job 15!
Finished job 16!
Starting job 17 on thread pool-1-thread-1
Starting job 18 on thread pool-1-thread-2
Finished job 17!
Finished job 18!
Starting job 19 on thread pool-1-thread-1
Starting job 12 on thread pool-1-thread-2
Finished job 19!
Starting job 13 on thread pool-1-thread-1
Finished job 12!
Starting job 14 on thread pool-1-thread-2
Finished job 13!
Finished job 14!
Thanks to Gavin Bisesi (#Daenyth) for refining my original idea into this!
Full code available here.

Related

Multiprocessing Processes Not Joining

I have this code but process worker don't finish all task
workers = []
for batch_index, batched_payloads in batches:
p = Process(target=self.process_batch, args=(batch_index, batched_payloads))
workers.append(p)
p.start()
for worker in workers:
worker.join()
hello any help please I have 20000 items i split to batch of 5000 , with multiprocessing it work on paralell but don't finish all work exit before finish, the process is to insert data for exemple it insert for any batch 100 item or more and not finish all (batch1 : 10..100, batch2 :5000...5200 batch3 : 10000..11000 batch3: 15000..15900

How to fire N Requests per Second in Scala

I am developing an application in Scala, which is kind of request processor in batch. It gets the required data from storage and forms the request and calls an API of another service (named ServiceB)
ServiceB allocates 110TPS of its throughput to my service. In order to utilize max available throughput, I want to make request at the rate of 100TPS.
How can I fire requests at rate of exactly 100 TPS ?
In theory, if the API call takes around 500 milliseconds to execute. One thread can execute max 2 request per second. So, 50 threads are needed to achieve 100TPS per second.
But Futures in scala are anyway executed parallely. This makes it difficult to come up with an exact number of threads required.
I tried the following
def main(args: Array[String]): Unit = {
val executorService = Executors.newFixedThreadPool(1) // threadpoolsize 1
implicit val executionContext: ExecutionContextExecutor = ExecutionContextFactory.get(executorService)
....
val request = getRequest(..)
val responseList = ListBuffer[Future[Response]]();
val stTime = System.currentTimeMillis()
val rateLimiter = RateLimiter.create(300) //guava-ratelimiter
while(System.currentTimeMillis() - stTime <= (1000*60*3)) { // Running for 3 mins
rateLimiter.acquire(1)
responseList += http(request) // using dispatch
}
//TODO : use different threadpool to process the futures in responseList.
}
Metrics from DataDog:
TPS is 83
Total no of calls to tat api is 16.4k
This doesn't seem to be firing calls at even 100 TPS (though I allowed 300 request per seconds, check the rate limiter value)
This problem is similar to load testing, how are load testing frameworks firing requests at X TPS constantly?
Thanks.

APScheduler add lots of jobs concurrently (database Jobstore)

How can I schedule lots of APScheduler jobs (4,000+) concurrently? (I must schedule all these after certain user events.)
Iteratively calling add_job simply takes too long with many jobs. But when I try to use AsyncIOScheduler and the following async code, I don't get any added performance increase either.
NOTE: my scheduler needs to connect to a SQL jobstore via SqlAlchemy
scheduler = AsyncIOScheduler(jobstores={"default": SQLAlchemyJobStore(url="a valid db connection str")})
scheduler.start()
def schedule_jobs_quickly():
# init lots of (fake) jobs
jobs = []
for i in range(3000):
jobs.append(i)
send_time = datetime.datetime.now() + datetime.timedelta(days=2)
# try to schedule jobs concurrently
start_time = time.time()
asyncio.get_event_loop().run_until_complete(schedule_all_jobs(jobs, send_time))
duration = time.time() - start_time
print(f"Created {len(jobs)} jobs in {duration} seconds")
async def schedule_all_jobs(all_jobs, send_time):
tasks = []
for job in all_jobs:
task = asyncio.ensure_future(schedule_job(job, send_time))
tasks.append(task)
await asyncio.gather(*tasks, return_exceptions=True)
async def schedule_job(job, send_time):
scheduler.add_job(send_email_if_needed, trigger=send_time)
Result is very slow. How to speed this up?
>>> schedule_jobs_quickly()
...
Created 3000 jobs in 401.9982771873474 seconds
For comparison, this is how long it took with a BackgroundScheduler() using the default memory jobstore:
Created 3000 jobs in 0.9155495166778564 seconds
So, it seems to be the database connections that are so expensive. Maybe there's a way to create multiple jobs using the same connection, instead of re-connecting for each add_job?
It's not the solution I was looking for, but I decided to give up on AsyncIOScheduler and instead schedule my many tasks in a separate thread so the rest of my program could continue without being held up by all of the DB connections. Example below.
from threading import Thread
def schedule_jobs_quickly():
# init lots of (fake) jobs
jobs = []
for i in range(3000):
jobs.append(i)
send_time = datetime.datetime.now() + datetime.timedelta(days=2)
# schedule jobs in new thread
scheduler_thread = Thread(target=schedule_email_jobs, args=(email_jobs,))
scheduler_thread.start()
def schedule_email_jobs(jobs):
for job in jobs:
scheduler.add_job(send_email, trigger=send_time)
def send_email():
# sends email

Color scheme in spark web UI

What does the blue zone represent? I can understand the green zone represents the computing time. By going from legend, the blue zone should represent scheduler delay.However, the numbers do not match as mentioned schedular delay is negligible to the executor time. So, what does it means?
The scheduler is the part of the master that constructs the DAG of stages and tasks and interacts with the cluster to distribute them in the most efficient way it can. Scheduler Delay is the overhead of how long it takes to ship tasks to the executors and get the results back.
This is how it is calculated in the most recent branch:
private[ui] def getSchedulerDelay(
info: TaskInfo, metrics: TaskMetricsUIData, currentTime: Long): Long = {
if (info.finished) {
val totalExecutionTime = info.finishTime - info.launchTime
val executorOverhead = (metrics.executorDeserializeTime +
metrics.resultSerializationTime)
math.max(
0,
totalExecutionTime - metrics.executorRunTime - executorOverhead -
getGettingResultTime(info, currentTime))
} else {
// The task is still running and the metrics like executorRunTime are not available.
0L
}
}

Partial results from spark Async Interface?

Is it possible to cancel an spark future and still get a smaller RDD with the processed elements?
Spark Async Actions "documented" here
http://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.rdd.AsyncRDDActions
And the future itself has a rich set of functions
http://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.FutureAction
The use case I was thinking of is to have a very huge map, that could be aborted afted 30 minutes of calculation, and still collect -or even iterate or saveAsObjectFile- the subset of the RDD that has been effectively mapped.
FutureAction.cancel causes a failure (see comment in JobWaiter.scala), so you cannot use it to get partial results. I don't think there's a way to do it through the async API.
Instead, you could stop processing the input after 30 minutes.
val stopTime = System.currentTimeMillis + 30 * 60 * 1000 // 30 minutes from now.
rdd.mapPartitions { partition =>
if (System.currentTimeMillis < stopTime) partition.map {
// Process it like usual.
???
} else {
// Time's up. Don't process anything.
Iterator()
}
}
Keep in mind that this only makes a difference once all the shuffle dependencies have completed. (It cannot stop the shuffle from being performed, even when 30 minutes have passed.)

Resources