I am using onTaskEnd Spark listener to get the number of records written into file like this:
import spark.implicits._
import org.apache.spark.sql._
import org.apache.spark.scheduler.{SparkListener, SparkListenerTaskEnd}
var recordsWritten: Long = 0L
val rowCountListener: SparkListener = new SparkListener() {
override def onTaskEnd(taskEnd: SparkListenerTaskEnd) {
synchronized {
recordsWritten += taskEnd.taskMetrics.outputMetrics.recordsWritten
}
}
}
def rowCountOf(proc: => Unit): Long = {
recordsWritten = 0L
spark.sparkContext.addSparkListener(rowCountListener)
try {
proc
} finally {
spark.sparkContext.removeSparkListener(rowCountListener)
}
recordsWritten
}
val rc = rowCountOf { (1 to 100).toDF.write.csv(s"test.csv") }
println(rc)
=> 100
However trying to run multiple actions in threads this obviously breaks:
Seq(1, 2, 3).par.foreach { i =>
val rc = rowCountOf { (1 to 100).toDF.write.csv(s"test${i}.csv") }
println(rc)
}
=> 600
=> 700
=> 750
I can have each thread declare its own variable, but spark context is still shared and I am unable to reckognize to which thread does specific SparkListenerTaskEnd event belong to. Is there any way to make it work?
(Right, maybe I could just make it separate spark jobs. But it's just a single piece of the program, so for the sake of simplicity I would prefer to stay with threads. In the worst case I'll just execute it serially or forget about counting records...)
A bit hackish but you could use accumulators as a filtering side-effect
val acc = spark.sparkContext.longAccumulator("write count")
df.filter { _ =>
acc.add(1)
true
}.write.csv(...)
println(s"rows written ${acc.count}")
I've got a Sequence (from File.walkTopDown) and I need to run a long-running operation on each of them. I'd like to use Kotlin best practices / coroutines, but I either get no parallelism, or way too much parallelism and hit a "too many open files" IO error.
File("/Users/me/Pictures/").walkTopDown()
.onFail { file, ex -> println("ERROR: $file caused $ex") }
.filter { ... only big images... }
.map { file ->
async { // I *think* I want async and not "launch"...
ImageProcessor.fromFile(file)
}
}
This doesn't seem to run it in parallel, and my multi-core CPU never goes above 1 CPU's worth. Is there a way with coroutines to run "NumberOfCores parallel operations" worth of Deferred jobs?
I looked at Multithreading using Kotlin Coroutines which first creates ALL the jobs then joins them, but that means completing the Sequence/file tree walk completly bfore the heavy processing join step, and that seems... iffy! Splitting it into a collect and a process step means the collection could run way ahead of the processing.
val jobs = ... the Sequence above...
.toSet()
println("Found ${jobs.size}")
jobs.forEach { it.await() }
This isn't specific to your problem, but it does answer the question of, "how to cap kotlin coroutines maximum concurrency".
EDIT: As of kotlinx.coroutines 1.6.0 (https://github.com/Kotlin/kotlinx.coroutines/issues/2919), you can use limitedParallelism, e.g. Dispatchers.IO.limitedParallelism(123).
Old solution: I thought to use newFixedThreadPoolContext at first, but 1) it's deprecated and 2) it would use threads and I don't think that's necessary or desirable (same with Executors.newFixedThreadPool().asCoroutineDispatcher()). This solution might have flaws I'm not aware of by using Semaphore, but it's very simple:
import kotlinx.coroutines.async
import kotlinx.coroutines.awaitAll
import kotlinx.coroutines.coroutineScope
import kotlinx.coroutines.sync.Semaphore
import kotlinx.coroutines.sync.withPermit
/**
* Maps the inputs using [transform] at most [maxConcurrency] at a time until all Jobs are done.
*/
suspend fun <TInput, TOutput> Iterable<TInput>.mapConcurrently(
maxConcurrency: Int,
transform: suspend (TInput) -> TOutput,
) = coroutineScope {
val gate = Semaphore(maxConcurrency)
this#mapConcurrently.map {
async {
gate.withPermit {
transform(it)
}
}
}.awaitAll()
}
Tests (apologies, it uses Spek, hamcrest, and kotlin test):
import kotlinx.coroutines.ExperimentalCoroutinesApi
import kotlinx.coroutines.delay
import kotlinx.coroutines.launch
import kotlinx.coroutines.runBlocking
import kotlinx.coroutines.test.TestCoroutineDispatcher
import org.hamcrest.MatcherAssert.assertThat
import org.hamcrest.Matchers.greaterThanOrEqualTo
import org.hamcrest.Matchers.lessThanOrEqualTo
import org.spekframework.spek2.Spek
import org.spekframework.spek2.style.specification.describe
import java.util.concurrent.atomic.AtomicInteger
import kotlin.test.assertEquals
#OptIn(ExperimentalCoroutinesApi::class)
object AsyncHelpersKtTest : Spek({
val actionDelay: Long = 1_000 // arbitrary; obvious if non-test dispatcher is used on accident
val testDispatcher = TestCoroutineDispatcher()
afterEachTest {
// Clean up the TestCoroutineDispatcher to make sure no other work is running.
testDispatcher.cleanupTestCoroutines()
}
describe("mapConcurrently") {
it("should run all inputs concurrently if maxConcurrency >= size") {
val concurrentJobCounter = AtomicInteger(0)
val inputs = IntRange(1, 2).toList()
val maxConcurrency = inputs.size
// https://github.com/Kotlin/kotlinx.coroutines/issues/1266 has useful info & examples
runBlocking(testDispatcher) {
print("start runBlocking $coroutineContext\n")
// We have to run this async so that the code afterwards can advance the virtual clock
val job = launch {
testDispatcher.pauseDispatcher {
val result = inputs.mapConcurrently(maxConcurrency) {
print("action $it $coroutineContext\n")
// Sanity check that we never run more in parallel than max
assertThat(concurrentJobCounter.addAndGet(1), lessThanOrEqualTo(maxConcurrency))
// Allow for virtual clock adjustment
delay(actionDelay)
// Sanity check that we never run more in parallel than max
assertThat(concurrentJobCounter.getAndAdd(-1), lessThanOrEqualTo(maxConcurrency))
print("action $it after delay $coroutineContext\n")
it
}
// Order is not guaranteed, thus a Set
assertEquals(inputs.toSet(), result.toSet())
print("end mapConcurrently $coroutineContext\n")
}
}
print("before advanceTime $coroutineContext\n")
// Start the coroutines
testDispatcher.advanceTimeBy(0)
assertEquals(inputs.size, concurrentJobCounter.get(), "All jobs should have been started")
testDispatcher.advanceTimeBy(actionDelay)
print("after advanceTime $coroutineContext\n")
assertEquals(0, concurrentJobCounter.get(), "All jobs should have finished")
job.join()
}
}
it("should run one at a time if maxConcurrency = 1") {
val concurrentJobCounter = AtomicInteger(0)
val inputs = IntRange(1, 2).toList()
val maxConcurrency = 1
runBlocking(testDispatcher) {
val job = launch {
testDispatcher.pauseDispatcher {
inputs.mapConcurrently(maxConcurrency) {
assertThat(concurrentJobCounter.addAndGet(1), lessThanOrEqualTo(maxConcurrency))
delay(actionDelay)
assertThat(concurrentJobCounter.getAndAdd(-1), lessThanOrEqualTo(maxConcurrency))
it
}
}
}
testDispatcher.advanceTimeBy(0)
assertEquals(1, concurrentJobCounter.get(), "Only one job should have started")
val elapsedTime = testDispatcher.advanceUntilIdle()
print("elapsedTime=$elapsedTime")
assertThat(
"Virtual time should be at least as long as if all jobs ran sequentially",
elapsedTime,
greaterThanOrEqualTo(actionDelay * inputs.size)
)
job.join()
}
}
it("should handle cancellation") {
val jobCounter = AtomicInteger(0)
val inputs = IntRange(1, 2).toList()
val maxConcurrency = 1
runBlocking(testDispatcher) {
val job = launch {
testDispatcher.pauseDispatcher {
inputs.mapConcurrently(maxConcurrency) {
jobCounter.addAndGet(1)
delay(actionDelay)
it
}
}
}
testDispatcher.advanceTimeBy(0)
assertEquals(1, jobCounter.get(), "Only one job should have started")
job.cancel()
testDispatcher.advanceUntilIdle()
assertEquals(1, jobCounter.get(), "Only one job should have run")
job.join()
}
}
}
})
Per https://play.kotlinlang.org/hands-on/Introduction%20to%20Coroutines%20and%20Channels/09_Testing, you may also need to adjust compiler args for the tests to run:
compileTestKotlin {
kotlinOptions {
// Needed for runBlocking test coroutine dispatcher?
freeCompilerArgs += "-Xuse-experimental=kotlin.Experimental"
freeCompilerArgs += "-Xopt-in=kotlin.RequiresOptIn"
}
}
testImplementation 'org.jetbrains.kotlinx:kotlinx-coroutines-test:1.4.1'
The problem with your first snippet is that it doesn't run at all - remember, Sequence is lazy, and you have to use a terminal operation such as toSet() or forEach(). Additionally, you need to limit the number of threads that can be used for that task via constructing a newFixedThreadPoolContext context and using it in async:
val pictureContext = newFixedThreadPoolContext(nThreads = 10, name = "reading pictures in parallel")
File("/Users/me/Pictures/").walkTopDown()
.onFail { file, ex -> println("ERROR: $file caused $ex") }
.filter { ... only big images... }
.map { file ->
async(pictureContext) {
ImageProcessor.fromFile(file)
}
}
.toList()
.forEach { it.await() }
Edit:
You have to use a terminal operator (toList) befor awaiting the results
I got it working with a Channel. But maybe I'm being redundant with your way?
val pipe = ArrayChannel<Deferred<ImageFile>>(20)
launch {
while (!(pipe.isEmpty && pipe.isClosedForSend)) {
imageFiles.add(pipe.receive().await())
}
println("pipe closed")
}
File("/Users/me/").walkTopDown()
.onFail { file, ex -> println("ERROR: $file caused $ex") }
.forEach { pipe.send(async { ImageFile.fromFile(it) }) }
pipe.close()
This doesn't preserve the order of the projection but otherwise limits the throughput to at most maxDegreeOfParallelism. Expand and extend as you see fit.
suspend fun <TInput, TOutput> (Collection<TInput>).inParallel(
maxDegreeOfParallelism: Int,
action: suspend CoroutineScope.(input: TInput) -> TOutput
): Iterable<TOutput> = coroutineScope {
val list = this#inParallel
if (list.isEmpty())
return#coroutineScope listOf<TOutput>()
val brake = Channel<Unit>(maxDegreeOfParallelism)
val output = Channel<TOutput>()
val counter = AtomicInteger(0)
this.launch {
repeat(maxDegreeOfParallelism) {
brake.send(Unit)
}
for (input in list) {
val task = this.async {
action(input)
}
this.launch {
val result = task.await()
output.send(result)
val completed = counter.incrementAndGet()
if (completed == list.size) {
output.close()
} else brake.send(Unit)
}
brake.receive()
}
}
val results = mutableListOf<TOutput>()
for (item in output) {
results.add(item)
}
return#coroutineScope results
}
Example usage:
val output = listOf(1, 2, 3).inParallel(2) {
it + 1
} // Note that output may not be in same order as list.
Why not use the asFlow() operator and then use flatMapMerge?
someCoroutineScope.launch(Dispatchers.Default) {
File("/Users/me/Pictures/").walkTopDown()
.asFlow()
.filter { ... only big images... }
.flatMapMerge(concurrencyLimit) { file ->
flow {
emit(runInterruptable { ImageProcessor.fromFile(file) })
}
}.catch { ... }
.collect()
}
Then you can limit the simultaneous open files while still processing them concurrently.
To limit the parallelism to some value there is limitedParallelism function starting from the 1.6.0 version of the kotlinx.coroutines library. It can be called on CoroutineDispatcher object. So to limit threads for parallel execution we can write something like:
val parallelismLimit = Runtime.getRuntime().availableProcessors()
val limitedDispatcher = Dispatchers.Default.limitedParallelism(parallelismLimit)
val scope = CoroutineScope(limitedDispatcher) // we can set limitedDispatcher for the whole scope
scope.launch { // or we can set limitedDispatcher for a coroutine launch(limitedDispatcher)
File("/Users/me/Pictures/").walkTopDown()
.onFail { file, ex -> println("ERROR: $file caused $ex") }
.filter { ... only big images... }
.map { file ->
async {
ImageProcessor.fromFile(file)
}
}.toList().awaitAll()
}
ImageProcessor.fromFile(file) will be executed in parallel using parallelismLimit number of threads.
This will cap coroutines to workers. I'd recommend watching https://www.youtube.com/watch?v=3WGM-_MnPQA
package com.example.workers
import kotlinx.coroutines.*
import kotlinx.coroutines.channels.ReceiveChannel
import kotlinx.coroutines.channels.produce
import kotlin.system.measureTimeMillis
class ChannellibgradleApplication
fun main(args: Array<String>) {
var myList = mutableListOf<Int>(3000,1200,1400,3000,1200,1400,3000)
runBlocking {
var myChannel = produce(CoroutineName("MyInts")) {
myList.forEach { send(it) }
}
println("Starting coroutineScope ")
var time = measureTimeMillis {
coroutineScope {
var workers = 2
repeat(workers)
{
launch(CoroutineName("Sleep 1")) { theHardWork(myChannel) }
}
}
}
println("Ending coroutineScope $time ms")
}
}
suspend fun theHardWork(channel : ReceiveChannel<Int>)
{
for(m in channel) {
println("Starting Sleep $m")
delay(m.toLong())
println("Ending Sleep $m")
}
}
I'm trying to embed ZMQ subscriber in a Runnable.
I'm able to start the Runnable for the first time and everything seems okay.
The problem is when I interrupt the Thread and try to start a new Thread, the subscriber does not get any messages. For example:
I have a publisher runnable
class ZMQPublisherRunnable() extends Runnable {
override def run() {
val ZMQcontext = ZMQ.context(1)
val publisher = ZMQcontext.socket(ZMQ.PUB)
var count = 0
publisher.connect(s"tcp://127.0.0.1:16666")
while (!Thread.currentThread().isInterrupted) {
try {
println(s"PUBLISHER -> $count")
publisher.send(s"PUBLISHER -> $count")
count += 1
Thread.sleep(1000)
}
catch {
case e: Exception =>
println(e.getMessage)
publisher.disconnect(s"tcp://127.0.0.1:16666")
ZMQcontext.close()
}
}
}
}
I have a Subscriber Runnable:
class ZMQSubscriberRunnable1() extends Runnable {
override def run() {
println("STARTING SUBSCRIBER")
val ZMQcontext = ZMQ.context(1)
val subscriber = ZMQcontext.socket(ZMQ.SUB)
subscriber.subscribe("".getBytes)
subscriber.bind(s"tcp://127.0.0.1:16666")
while (!Thread.currentThread().isInterrupted) {
try {
println("waiting")
val mesg = new String(subscriber.recv(0))
println(s"SUBSCRIBER -> $mesg")
}
catch {
case e: Exception =>
println(e.getMessage)
subscriber.unbind("tcp://127.0.0.1:16666")
subscriber.close()
ZMQcontext.close()
}
}
}
}
My main code looks like this:
object Application extends App {
val zmqPUB = new ZMQPublisherRunnable
val zmqThreadPUB = new Thread(zmqPUB, "MY_PUB")
zmqThreadPUB.setDaemon(true)
zmqThreadPUB.start()
val zmqRunnable = new ZMQSubscriberRunnable1
val zmqThread = new Thread(zmqRunnable, "MY_TEST")
zmqThread.setDaemon(true)
zmqThread.start()
Thread.sleep(10000)
zmqThread.interrupt()
zmqThread.join()
Thread.sleep(2000)
val zmqRunnable_2 = new ZMQSubscriberRunnable1
val zmqThread_2 = new Thread(zmqRunnable_2, "MY_TEST_2")
zmqThread_2.setDaemon(true)
zmqThread_2.start()
Thread.sleep(10000)
zmqThread_2.interrupt()
zmqThread_2.join()
}
The first time I start the Subscriber, I'm able to receive all messages:
STARTING SUBSCRIBER
PUBLISHER -> 0
waiting
PUBLISHER -> 1
SUBSCRIBER -> PUBLISHER -> 1
waiting
PUBLISHER -> 2
SUBSCRIBER -> PUBLISHER -> 2
waiting
PUBLISHER -> 3
SUBSCRIBER -> PUBLISHER -> 3
waiting
...
Once I interrupt the Thread and start a new one from the same Runnable, I'm not able to read messages anymore. It is waiting forever
STARTING SUBSCRIBER
waiting
PUBLISHER -> 13
PUBLISHER -> 14
PUBLISHER -> 15
PUBLISHER -> 16
PUBLISHER -> 17
...
Any insights about what I'm doing wrong?
Thanks
JeroMQ is not Thread.interrupt safe.
To work around it you have to stop the ZMQContext before you call the Thread.interrupt
Instantiate the ZMQContext outside the Runnable
Pass the ZMQContext as an argument to the ZMQ Runnable (You can also use it is a global variable)
Call zmqContext.term()
Call zmqSubThread.interrupt()
Call zmqSubThread.join()
For more details take a look at: https://github.com/zeromq/jeromq/issues/116
My subscriber Runnable looks like:
class ZMQSubscriberRunnable(zmqContext:ZMQ.Context, port: Int, ip: String, topic: String) extends Runnable {
override def run() {
var contextTerminated = false
val subscriber = zmqContext.socket(ZMQ.SUB)
subscriber.subscribe(topic.getBytes)
subscriber.bind(s"tcp://$ip:$port")
while (!contextTerminated && !Thread.currentThread().isInterrupted) {
try {
println(new String(subscriber.recv(0)))
}
catch {
case e: ZMQException if e.getErrorCode == ZMQ.Error.ETERM.getCode =>
contextTerminated = true
subscriber.close()
case e: Exception =>
zmqContext.term()
subscriber.close()
}
}
}
}
To interrupt the Thread:
zmqContext.term()
zmqSubThread.interrupt()
zmqSubThread.join()
For example
import scala.actors.Actor
import scala.actors.Actor._
object Main {
class Pong extends Actor {
def act() {
var pongCount = 0
while (true) {
receive {
case "Ping" =>
if (pongCount % 1000 == 0)
Console.println("Pong: ping "+pongCount)
sender ! "Pong"
pongCount = pongCount + 1
case "Stop" =>
Console.println("Pong: stop")
exit()
}
}
}
}
class Ping(count: Int, pong: Actor) extends Actor {
def act() {
var pingsLeft = count - 1
pong ! "Ping"
while (true) {
receive {
case "Pong" =>
if (pingsLeft % 1000 == 0)
Console.println("Ping: pong")
if (pingsLeft > 0) {
pong ! "Ping"
pingsLeft -= 1
} else {
Console.println("Ping: stop")
pong ! "Stop"
exit()
}
}
}
}
}
def main(args: Array[String]): Unit = {
val pong = new Pong
val ping = new Ping(100000, pong)
ping.start
pong.start
println("???")
}
}
I try to print "???" after the two actors call exit(), but now it is printed before "Ping: Stop" and "Pong stop"
I have try have a flag in the actor, flag is false while actor is running, and flag is true when actor stops, and in the main func, there is a while loop, such as while (actor.flag == false) {}, but it doesn't works, it is a endless loop:-/
So, please give me some advice.
If you need synchronous calls in akka, use ask pattern. Like
Await.result(ping ? "ping")
Also, you'd better use actor system to create actors.
import akka.actor.{ActorRef, Props, Actor, ActorSystem}
import akka.pattern.ask
import akka.util.Timeout
import scala.concurrent.Await
import scala.concurrent.duration._
import scala.concurrent.ExecutionContext.Implicits.global
object Test extends App {
implicit val timeout = Timeout(3 second)
val system = ActorSystem("ActorSystem")
class Pong extends Actor {
def receive: Receive = {
case "Ping" =>
println("ping")
context.stop(self)
}
}
lazy val pong = system.actorOf(Props(new Pong), "Pong")
val x = pong.ask("Ping")
val res = Await.result(x, timeout.duration)
println("????")
system.shutdown()
}
I want to implement something like the producer-consumer problem (with only one information transmitted at a time), but I want the producer to wait for someone to take his message before leaving.
Here is an example that doesn't block the producer but works otherwise.
class Channel[T]
{
private var _msg : Option[T] = None
def put(msg : T) : Unit =
{
this.synchronized
{
waitFor(_msg == None)
_msg = Some(msg)
notifyAll
}
}
def get() : T =
{
this.synchronized
{
waitFor(_msg != None)
val ret = _msg.get
_msg = None
notifyAll
return ret
}
}
private def waitFor(b : => Boolean) =
while(!b) wait
}
How can I changed it so the producers gets blocked (as the consumer is) ?
I tried to add another waitFor at the end of but sometimes my producer doesn't get released.
For instance, if I have put ; get || get ; put, most of the time it works, but sometimes, the first put is not terminated and the left thread never even runs the get method (I print something once the put call is terminated, and in this case, it never gets printed).
This is why you should use a standard class, SynchronousQueue in this case.
If you really want to work through your problematic code, start by giving us a failing test case or a stack trace from when the put is blocking.
You can do this by means of a BlockingQueue descendant whose producer put () method creates a semaphore/event object that is queued up with the passed message and then the producer thread waits on it.
The consumer get() method extracts a message from the queue and signals its semaphore, so allowing its original producer to run on.
This allows a 'synchronous queue' with actual queueing functionality, should that be what you want?
I came up with something that appears to be working.
class Channel[T]
{
class Transfer[A]
{
protected var _msg : Option[A] = None
def msg_=(a : A) = _msg = Some(a)
def msg : A =
{
// Reading the message destroys it
val ret = _msg.get
_msg = None
return ret
}
def isEmpty = _msg == None
def notEmpty = !isEmpty
}
object Transfer {
def apply[A](msg : A) : Transfer[A] =
{
var t = new Transfer[A]()
t.msg = msg
return t
}
}
// Hacky but Transfer has to be invariant
object Idle extends Transfer[T]
protected var offer : Transfer[T] = Idle
protected var request : Transfer[T] = Idle
def put(msg : T) : Unit =
{
this.synchronized
{
// push an offer as soon as possible
waitFor(offer == Idle)
offer = Transfer(msg)
// request the transfer
requestTransfer
// wait for the transfer to go (ie the msg to be absorbed)
waitFor(offer isEmpty)
// delete the completed offer
offer = Idle
notifyAll
}
}
def get() : T =
{
this.synchronized
{
// push a request as soon as possible
waitFor(request == Idle)
request = new Transfer()
// request the transfer
requestTransfer
// wait for the transfer to go (ie the msg to be delivered)
waitFor(request notEmpty)
val ret = request.msg
// delete the completed request
request = Idle
notifyAll
return ret
}
}
protected def requestTransfer()
{
this.synchronized
{
if(offer != Idle && request != Idle)
{
request.msg = offer.msg
notifyAll
}
}
}
protected def waitFor(b : => Boolean) =
while(!b) wait
}
It has the advantage of respecting symmetry between producer and consumer but it is a bit longer than what I had before.
Thanks for your help.
Edit : It is better but still not safeā¦