Spark Streaming: Exception thrown while > writing record: BatchAllocationEvent

Spark Streaming: Exception thrown while > writing record: BatchAllocationEvent - apache-spark

I shut down a Spark StreamingContext with the following code.
Essentially a thread monitors for a boolean switch and then calls StreamingContext.stop(true,true)
Everything seems to process and all my data appears to have been collected. However, I get the following exception on shutdown.
Can I ignore? It looks like there is potential for data loss.
18/03/07 11:46:40 WARN ReceivedBlockTracker: Exception thrown while
writing record: BatchAllocationEvent(1520452000000
ms,AllocatedBlocks(Map(0 -> ArrayBuffer()))) to the WriteAheadLog.
java.lang.IllegalStateException: close() was called on
BatchedWriteAheadLog before write request with time 1520452000001
could be fulfilled.
at org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:86)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:234)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.allocateBlocksToBatch(ReceivedBlockTracker.scala:118)
at org.apache.spark.streaming.scheduler.ReceiverTracker.allocateBlocksToBatch(ReceiverTracker.scala:213)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:248)
The Thread
var stopScc=false
private def stopSccThread(): Unit = {
val thread = new Thread {
override def run {
var continueRun=true
while (continueRun) {
logger.debug("Checking status")
if (stopScc == true) {
getSparkStreamingContext(fieldVariables).stop(true, true)
logger.info("Called Stop on Streaming Context")
continueRun=false
}
Thread.sleep(50)
}
}
}
thread.start
}
The Stream
#throws(classOf[IKodaMLException])
def startStream(ip: String, port: Int): Unit = {
try {
val ssc = getSparkStreamingContext(fieldVariables)
ssc.checkpoint("./ikoda/cp")
val lines = ssc.socketTextStream(ip, port, StorageLevel.MEMORY_AND_DISK_SER)
lines.print
val lmap = lines.map {
l =>
if (l.contains("IKODA_END_STREAM")) {
stopScc = true
}
l
}
lmap.foreachRDD {
r =>
if (r.count() > 0) {
logger.info(s"RECEIVED: ${r.toString()} first: ${r.first().toString}")
r.saveAsTextFile("./ikoda/test/test")
}
else {
logger.info("Empty RDD. No data received")
}
}
ssc.start()
ssc.awaitTermination()
}
catch {
case e: Exception =>
logger.error(e.getMessage, e)
throw new IKodaMLException(e.getMessage, e)
}

I had the same issue and calling close() instead of stop fixed it.

Related

Get return value from thread, is this Kotlin code thread safe?

I would like to run some treads, wait till all of them are finished and get the results.
Possible way to do that would be in the code below. Is it thread safe though?
import kotlin.concurrent.thread
sealed class Errorneous<R>
data class Success<R>(val result: R) : Errorneous<R>()
data class Fail<R>(val error: Exception) : Errorneous<R>()
fun <R> thread_with_result(fn: () -> R): (() -> Errorneous<R>) {
var r: Errorneous<R>? = null
val t = thread {
r = try { Success(fn()) } catch (e: Exception) { Fail(e) }
}
return {
t.join()
r!!
}
}
fun main() {
val tasks = listOf({ 2 * 2 }, { 3 * 3 })
val results = tasks
.map{ thread_with_result(it) }
.map{ it() }
println(results)
}
P.S.
Are there better built-in tools in Kotlin to do that? Like process 10000 tasks with pool of 10 threads?
It should be threads, not coroutines, as it will be used with legacy code and I don't know if it works well with coroutines.

Seems like Java has Executors that doing exactly that
fun <R> execute_in_parallel(tasks: List<() -> R>, threads: Int): List<Errorneous<R>> {
val executor = Executors.newFixedThreadPool(threads)
val fresults = executor.invokeAll(tasks.map { task ->
Callable<Errorneous<R>> {
try { Success(task()) } catch (e: Exception) { Fail(e) }
}
})
return fresults.map { future -> future.get() }
}

how to cap kotlin coroutines maximum concurrency

I've got a Sequence (from File.walkTopDown) and I need to run a long-running operation on each of them. I'd like to use Kotlin best practices / coroutines, but I either get no parallelism, or way too much parallelism and hit a "too many open files" IO error.
File("/Users/me/Pictures/").walkTopDown()
.onFail { file, ex -> println("ERROR: $file caused $ex") }
.filter { ... only big images... }
.map { file ->
async { // I *think* I want async and not "launch"...
ImageProcessor.fromFile(file)
}
}
This doesn't seem to run it in parallel, and my multi-core CPU never goes above 1 CPU's worth. Is there a way with coroutines to run "NumberOfCores parallel operations" worth of Deferred jobs?
I looked at Multithreading using Kotlin Coroutines which first creates ALL the jobs then joins them, but that means completing the Sequence/file tree walk completly bfore the heavy processing join step, and that seems... iffy! Splitting it into a collect and a process step means the collection could run way ahead of the processing.
val jobs = ... the Sequence above...
.toSet()
println("Found ${jobs.size}")
jobs.forEach { it.await() }

This isn't specific to your problem, but it does answer the question of, "how to cap kotlin coroutines maximum concurrency".
EDIT: As of kotlinx.coroutines 1.6.0 (https://github.com/Kotlin/kotlinx.coroutines/issues/2919), you can use limitedParallelism, e.g. Dispatchers.IO.limitedParallelism(123).
Old solution: I thought to use newFixedThreadPoolContext at first, but 1) it's deprecated and 2) it would use threads and I don't think that's necessary or desirable (same with Executors.newFixedThreadPool().asCoroutineDispatcher()). This solution might have flaws I'm not aware of by using Semaphore, but it's very simple:
import kotlinx.coroutines.async
import kotlinx.coroutines.awaitAll
import kotlinx.coroutines.coroutineScope
import kotlinx.coroutines.sync.Semaphore
import kotlinx.coroutines.sync.withPermit
/**
* Maps the inputs using [transform] at most [maxConcurrency] at a time until all Jobs are done.
*/
suspend fun <TInput, TOutput> Iterable<TInput>.mapConcurrently(
maxConcurrency: Int,
transform: suspend (TInput) -> TOutput,
) = coroutineScope {
val gate = Semaphore(maxConcurrency)
this#mapConcurrently.map {
async {
gate.withPermit {
transform(it)
}
}
}.awaitAll()
}
Tests (apologies, it uses Spek, hamcrest, and kotlin test):
import kotlinx.coroutines.ExperimentalCoroutinesApi
import kotlinx.coroutines.delay
import kotlinx.coroutines.launch
import kotlinx.coroutines.runBlocking
import kotlinx.coroutines.test.TestCoroutineDispatcher
import org.hamcrest.MatcherAssert.assertThat
import org.hamcrest.Matchers.greaterThanOrEqualTo
import org.hamcrest.Matchers.lessThanOrEqualTo
import org.spekframework.spek2.Spek
import org.spekframework.spek2.style.specification.describe
import java.util.concurrent.atomic.AtomicInteger
import kotlin.test.assertEquals
#OptIn(ExperimentalCoroutinesApi::class)
object AsyncHelpersKtTest : Spek({
val actionDelay: Long = 1_000 // arbitrary; obvious if non-test dispatcher is used on accident
val testDispatcher = TestCoroutineDispatcher()
afterEachTest {
// Clean up the TestCoroutineDispatcher to make sure no other work is running.
testDispatcher.cleanupTestCoroutines()
}
describe("mapConcurrently") {
it("should run all inputs concurrently if maxConcurrency >= size") {
val concurrentJobCounter = AtomicInteger(0)
val inputs = IntRange(1, 2).toList()
val maxConcurrency = inputs.size
// https://github.com/Kotlin/kotlinx.coroutines/issues/1266 has useful info & examples
runBlocking(testDispatcher) {
print("start runBlocking $coroutineContext\n")
// We have to run this async so that the code afterwards can advance the virtual clock
val job = launch {
testDispatcher.pauseDispatcher {
val result = inputs.mapConcurrently(maxConcurrency) {
print("action $it $coroutineContext\n")
// Sanity check that we never run more in parallel than max
assertThat(concurrentJobCounter.addAndGet(1), lessThanOrEqualTo(maxConcurrency))
// Allow for virtual clock adjustment
delay(actionDelay)
// Sanity check that we never run more in parallel than max
assertThat(concurrentJobCounter.getAndAdd(-1), lessThanOrEqualTo(maxConcurrency))
print("action $it after delay $coroutineContext\n")
it
}
// Order is not guaranteed, thus a Set
assertEquals(inputs.toSet(), result.toSet())
print("end mapConcurrently $coroutineContext\n")
}
}
print("before advanceTime $coroutineContext\n")
// Start the coroutines
testDispatcher.advanceTimeBy(0)
assertEquals(inputs.size, concurrentJobCounter.get(), "All jobs should have been started")
testDispatcher.advanceTimeBy(actionDelay)
print("after advanceTime $coroutineContext\n")
assertEquals(0, concurrentJobCounter.get(), "All jobs should have finished")
job.join()
}
}
it("should run one at a time if maxConcurrency = 1") {
val concurrentJobCounter = AtomicInteger(0)
val inputs = IntRange(1, 2).toList()
val maxConcurrency = 1
runBlocking(testDispatcher) {
val job = launch {
testDispatcher.pauseDispatcher {
inputs.mapConcurrently(maxConcurrency) {
assertThat(concurrentJobCounter.addAndGet(1), lessThanOrEqualTo(maxConcurrency))
delay(actionDelay)
assertThat(concurrentJobCounter.getAndAdd(-1), lessThanOrEqualTo(maxConcurrency))
it
}
}
}
testDispatcher.advanceTimeBy(0)
assertEquals(1, concurrentJobCounter.get(), "Only one job should have started")
val elapsedTime = testDispatcher.advanceUntilIdle()
print("elapsedTime=$elapsedTime")
assertThat(
"Virtual time should be at least as long as if all jobs ran sequentially",
elapsedTime,
greaterThanOrEqualTo(actionDelay * inputs.size)
)
job.join()
}
}
it("should handle cancellation") {
val jobCounter = AtomicInteger(0)
val inputs = IntRange(1, 2).toList()
val maxConcurrency = 1
runBlocking(testDispatcher) {
val job = launch {
testDispatcher.pauseDispatcher {
inputs.mapConcurrently(maxConcurrency) {
jobCounter.addAndGet(1)
delay(actionDelay)
it
}
}
}
testDispatcher.advanceTimeBy(0)
assertEquals(1, jobCounter.get(), "Only one job should have started")
job.cancel()
testDispatcher.advanceUntilIdle()
assertEquals(1, jobCounter.get(), "Only one job should have run")
job.join()
}
}
}
})
Per https://play.kotlinlang.org/hands-on/Introduction%20to%20Coroutines%20and%20Channels/09_Testing, you may also need to adjust compiler args for the tests to run:
compileTestKotlin {
kotlinOptions {
// Needed for runBlocking test coroutine dispatcher?
freeCompilerArgs += "-Xuse-experimental=kotlin.Experimental"
freeCompilerArgs += "-Xopt-in=kotlin.RequiresOptIn"
}
}
testImplementation 'org.jetbrains.kotlinx:kotlinx-coroutines-test:1.4.1'

The problem with your first snippet is that it doesn't run at all - remember, Sequence is lazy, and you have to use a terminal operation such as toSet() or forEach(). Additionally, you need to limit the number of threads that can be used for that task via constructing a newFixedThreadPoolContext context and using it in async:
val pictureContext = newFixedThreadPoolContext(nThreads = 10, name = "reading pictures in parallel")
File("/Users/me/Pictures/").walkTopDown()
.onFail { file, ex -> println("ERROR: $file caused $ex") }
.filter { ... only big images... }
.map { file ->
async(pictureContext) {
ImageProcessor.fromFile(file)
}
}
.toList()
.forEach { it.await() }
Edit:
You have to use a terminal operator (toList) befor awaiting the results

I got it working with a Channel. But maybe I'm being redundant with your way?
val pipe = ArrayChannel<Deferred<ImageFile>>(20)
launch {
while (!(pipe.isEmpty && pipe.isClosedForSend)) {
imageFiles.add(pipe.receive().await())
}
println("pipe closed")
}
File("/Users/me/").walkTopDown()
.onFail { file, ex -> println("ERROR: $file caused $ex") }
.forEach { pipe.send(async { ImageFile.fromFile(it) }) }
pipe.close()

This doesn't preserve the order of the projection but otherwise limits the throughput to at most maxDegreeOfParallelism. Expand and extend as you see fit.
suspend fun <TInput, TOutput> (Collection<TInput>).inParallel(
maxDegreeOfParallelism: Int,
action: suspend CoroutineScope.(input: TInput) -> TOutput
): Iterable<TOutput> = coroutineScope {
val list = this#inParallel
if (list.isEmpty())
return#coroutineScope listOf<TOutput>()
val brake = Channel<Unit>(maxDegreeOfParallelism)
val output = Channel<TOutput>()
val counter = AtomicInteger(0)
this.launch {
repeat(maxDegreeOfParallelism) {
brake.send(Unit)
}
for (input in list) {
val task = this.async {
action(input)
}
this.launch {
val result = task.await()
output.send(result)
val completed = counter.incrementAndGet()
if (completed == list.size) {
output.close()
} else brake.send(Unit)
}
brake.receive()
}
}
val results = mutableListOf<TOutput>()
for (item in output) {
results.add(item)
}
return#coroutineScope results
}
Example usage:
val output = listOf(1, 2, 3).inParallel(2) {
it + 1
} // Note that output may not be in same order as list.

Why not use the asFlow() operator and then use flatMapMerge?
someCoroutineScope.launch(Dispatchers.Default) {
File("/Users/me/Pictures/").walkTopDown()
.asFlow()
.filter { ... only big images... }
.flatMapMerge(concurrencyLimit) { file ->
flow {
emit(runInterruptable { ImageProcessor.fromFile(file) })
}
}.catch { ... }
.collect()
}
Then you can limit the simultaneous open files while still processing them concurrently.

To limit the parallelism to some value there is limitedParallelism function starting from the 1.6.0 version of the kotlinx.coroutines library. It can be called on CoroutineDispatcher object. So to limit threads for parallel execution we can write something like:
val parallelismLimit = Runtime.getRuntime().availableProcessors()
val limitedDispatcher = Dispatchers.Default.limitedParallelism(parallelismLimit)
val scope = CoroutineScope(limitedDispatcher) // we can set limitedDispatcher for the whole scope
scope.launch { // or we can set limitedDispatcher for a coroutine launch(limitedDispatcher)
File("/Users/me/Pictures/").walkTopDown()
.onFail { file, ex -> println("ERROR: $file caused $ex") }
.filter { ... only big images... }
.map { file ->
async {
ImageProcessor.fromFile(file)
}
}.toList().awaitAll()
}
ImageProcessor.fromFile(file) will be executed in parallel using parallelismLimit number of threads.

This will cap coroutines to workers. I'd recommend watching https://www.youtube.com/watch?v=3WGM-_MnPQA
package com.example.workers
import kotlinx.coroutines.*
import kotlinx.coroutines.channels.ReceiveChannel
import kotlinx.coroutines.channels.produce
import kotlin.system.measureTimeMillis
class ChannellibgradleApplication
fun main(args: Array<String>) {
var myList = mutableListOf<Int>(3000,1200,1400,3000,1200,1400,3000)
runBlocking {
var myChannel = produce(CoroutineName("MyInts")) {
myList.forEach { send(it) }
}
println("Starting coroutineScope ")
var time = measureTimeMillis {
coroutineScope {
var workers = 2
repeat(workers)
{
launch(CoroutineName("Sleep 1")) { theHardWork(myChannel) }
}
}
}
println("Ending coroutineScope $time ms")
}
}
suspend fun theHardWork(channel : ReceiveChannel<Int>)
{
for(m in channel) {
println("Starting Sleep $m")
delay(m.toLong())
println("Ending Sleep $m")
}
}

Scala - multithreading, finish main thread when any child thread finishes

I am building a method that takes x-sized sequence of methods and returns the result of the first method to finish.
def invokeAny(work: Seq[() => Int]): Int = ???
How can I accomplish this by using Threads? (no futures allowed)
This is the best I have been able to come up with, but seems not to work in all circumstances.
def invokeAny(work: Seq[() => Int]): Int = {
#volatile var result = 0 // set to return value of any work function
val main = Thread.currentThread()
val threads: Seq[Thread] = work.map(work => new Thread( new Runnable {
def run { result = work(); main.interrupt(); }}))
threads.foreach(_.start())
for(thread <- threads) {
try {
thread.join()
} catch {
// We've been interrupted: finish
case e: InterruptedException => return result
}
}
return result
}

Not the pretiest answer, but seemed to work:
def invokeAny(work: Seq[() => Int]): Int = {
#volatile var result = 0 // set to return value of any work function
val main = Thread.currentThread()
var threads: Seq[Thread] = Seq()
//Interrupts all threads after one is interrupted
def interruptAll = {
main.interrupt()
for(thread <- threads) {
thread.interrupt()
}
}
threads = work.map(work => new Thread(
new Runnable {
def run {
result = try {
work() } catch {
case e:InterruptedException => return
}
interruptAll;
}
}))
threads.foreach(_.start())
for(thread <- threads) {
try {
thread.join()
} catch {
// We've been interrupted: finish
case e: InterruptedException => return result
}
}
return result
}

Using a BlockingQueue, no shared mutable state, worker threads write to a queue, the main threads wait till they finish and read from the queue then do something with the results like sum
def invokeAny1(work: Seq[() => Int]): Int = {
val queue = new ArrayBlockingQueue[Int](work.size)
val threads: Seq[Thread] = work.map(w => new Thread( new Runnable {
def run {
val result= w()
queue.put(result) }}))
threads.foreach(_.start())
threads.foreach(_.join())
var sum:Int=0
while(!queue.isEmpty) {
sum +=queue.take()
}
sum
}
Using a CountDownLatch.
Worker threads increment an atomic variable.
When all the threads are done the latch is released and the main thread can read the data from the atomic variable
def invokeAny2(work: Seq[() => Int]): Int = {
val total=new AtomicInteger
val latch= new CountDownLatch(work.size)
val threads: Seq[Thread] = work.map(w => new Thread( new Runnable {
def run {
val result= w()
total.getAndAdd(result)
latch.countDown
}}))
threads.foreach(_.start())
latch.await //wait till the latch is released
total.get
}
}

How to read InputStream only once using CustomReceiver

I have written custom receiver to receive the stream that is being generated by one of our application. The receiver starts the process gets the stream and then cals store. However, the receive method gets called multiple times, I have written proper loop break condition, but, could not do it. How to ensure it only reads once and does not read the already processed data.?
Here is my custom receiver code:
class MyReceiver() extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging {
def onStart() {
new Thread("Splunk Receiver") {
override def run() { receive() }
}.start()
}
def onStop() {
}
private def receive() {
try {
/* My Code to run a process and get the stream */
val reader = new ResultsReader(job.getResults()); // ResultReader is reader for the appication
var event:String = reader.getNextLine;
while (!isStopped || event != null) {
store(event);
event = reader.getNextLine;
}
reader.close()
} catch {
case t: Throwable =>
restart("Error receiving data", t)
}
}
}
Where did i go wrong.?
Problems
1) The job and stream reading happening after every 2 seconds and same data is piling up. So, for 60 line of data, i am getting 1800 or greater some times, in total.
Streaming Code:
val conf = new SparkConf
conf.setAppName("str1");
conf.setMaster("local[2]")
conf.set("spark.driver.allowMultipleContexts", "true");
val ssc = new StreamingContext(conf, Minutes(2));
val customReceiverStream = ssc.receiverStream(new MyReceiver)
println(" searching ");
//if(customReceiverStream.count() > 0 ){
customReceiverStream.foreachRDD(x => {println("=====>"+ x.count());x.count()});
//}
ssc.start();
ssc.awaitTermination()
Note: I am trying this in my local cluster, and with master as local[2].

Nicifying execution contex's thread pool's output for logging/debuging in scala

Is there is nice way to rename a pool in/for an executon context to produce nicer output in logs/wile debugging. Not to be look like ForkJoinPool-2-worker-7 (because ~2 tells nothing about pool's purose in app) but WorkForkJoinPool-2-worker-7.. wihout creating new WorkForkJoinPool class for it?
Example:
object LogSample extends App {
val ex1 = ExecutionContext.global
val ex2 = ExecutionContext.fromExecutor(null:Executor) // another global ex context
val system = ActorSystem("system")
val log = Logging(system.eventStream, "my.nice.string")
Future {
log.info("1")
}(ex1)
Future {
log.info("2")
}(ex2)
Thread.sleep(1000)
// output, like this:
/*
[INFO] [09/14/2015 21:53:34.897] [ForkJoinPool-2-worker-7] [my.nice.string] 2
[INFO] [09/14/2015 21:53:34.897] [ForkJoinPool-1-worker-7] [my.nice.string] 1
*/
}

You need to implement custom thread factory, something like this:
class CustomThreadFactory(prefix: String) extends ForkJoinPool.ForkJoinWorkerThreadFactory {
def newThread(fjp: ForkJoinPool): ForkJoinWorkerThread = {
val thread = new ForkJoinWorkerThread(fjp) {}
thread.setName(prefix + "-" + thread.getName)
thread
}
}
val threadFactory = new CustomThreadFactory("custom prefix here")
val uncaughtExceptionHandler = new UncaughtExceptionHandler {
override def uncaughtException(t: Thread, e: Throwable) = e.printStackTrace()
}
val executor = new ForkJoinPool(10, threadFactory, uncaughtExceptionHandler, true)
val ex2 = ExecutionContext.fromExecutor(executor) // another global ex context
val system = ActorSystem("system")
val log = Logging(system.eventStream, "my.nice.string")
Future {
log.info("2") //[INFO] [09/15/2015 18:22:43.728] [custom prefix here-ForkJoinPool-1-worker-29] [my.nice.string] 2
}(ex2)
Thread.sleep(1000)

Ok. Seems this is not possible (particulary for default global iml) due to current scala ExecutonContext implementation.
What I could do is just copy that impl and replace:
class DefaultThreadFactory(daemonic: Boolean) ... {
def wire[T <: Thread](thread: T): T = {
thread.setName("My" + thread.getId) // ! add this one (make 'My' to be variable)
thread.setDaemon(daemonic)
thread.setUncaughtExceptionHandler(uncaughtExceptionHandler)
thread
}...
because threadFactory there
val threadFactory = new DefaultThreadFactory(daemonic = true)
is harcoded ...
(seems Vladimir Petrosyan was first showing nicer way :) )

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark Streaming: Exception thrown while > writing record: BatchAllocationEvent - apache-spark

I had the same issue and calling close() instead of stop fixed it.

Related

Get return value from thread, is this Kotlin code thread safe?

how to cap kotlin coroutines maximum concurrency

Scala - multithreading, finish main thread when any child thread finishes

How to read InputStream only once using CustomReceiver

Nicifying execution contex's thread pool's output for logging/debuging in scala

Categories

Resources