Stopping spark steaming context on encountering certain message on Kafka - apache-spark

In my Spark Streaming application I am reading data from certain Kafka topic. While reading from topic whenever I encounter certain message (for example: "poison") I want to stop the streaming. Currently I am achieving this using following code:
jsc is instance of JavaStreamingContext and directStream is instance of JavaPairInputDStream.
LongAccumulator poisonNotifier = sc.longAccumulator("poisonNotifier");
directStream.foreachRDD(rdd -> {
RDD<Row> rows = rdd.values().map(value -> {
if (value.equals("poison") {
poisonNotifier.add(1);
} else {
...
}
return row;
}).rdd();
});
jsc.start();
ExecutorService poisonMonitor = Executors.newSingleThreadExecutor();
poisonMonitor.execute(() -> {
while (true) {
if (poisonNotifier.value() > 0) {
jsc.stop(false, true);
break;
}
}
});
try {
jsc.awaitTermination();
} catch (InterruptedException e) {
e.printStackTrace();
}
poisonMonitor.shutdown();
Although this approach is working, this doesn't sounds like right approach to me. Is there any other better(cleaner) way to achieve the same?

Related

How to reshuffle in apache beam using spark runner

I am doing this simulation using the spark runner:
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).create();
Pipeline p = Pipeline.create(options);
p.apply(Create.of(1))
.apply(ParDo.of(new DoFn<Integer, Integer>() {
#ProcessElement
public void apply(#Element Integer element, OutputReceiver<Integer> outputReceiver) {
IntStream.range(0, 4_000_000).forEach(outputReceiver::output);
}
}))
.apply(Reshuffle.viaRandomKey())
.apply(ParDo.of(new DoFn<Integer, Integer>() {
#ProcessElement
public void apply(#Element Integer element, OutputReceiver<Integer> outputReceiver) {
try {
// simulate a rpc call of 10ms
Thread.sleep(10);
} catch (InterruptedException e) {
e.printStackTrace();
}
outputReceiver.output(element);
}
}));
PipelineResult result = p.run();
result.waitUntilFinish();
I am running using --runner=SparkRunner --sparkMaster=local[8] but only 1 thread is used after the reshuffle.
Why the Rechuffle is not working?
If I change the reshuffle for this:
.apply(MapElements.into(kvs(integers(), integers())).via(e -> KV.of(e % 8, e)))
.apply(GroupByKey.create())
.apply(Values.create())
.apply(Flatten.iterables())
Then I get 8 thread running.
BR, Rafael.
It looks like Reshuffle on Beam on Spark boils down to the implementation at
https://github.com/apache/beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/GroupCombineFunctions.java#L191
I wonder if in this case both rdd.context().defaultParallelism() and rdd.getNumPartitions() are 1. I've filed https://issues.apache.org/jira/browse/BEAM-10834 to investigate.
In the meantime, you can use GroupByKey to get the desired parallelism as you've indicated. (If you don't literally have integers, you could try using the hash of your element, a Math.random(), or even an incrementing counter as the key).

Spark Streaming: Exception thrown while > writing record: BatchAllocationEvent

I shut down a Spark StreamingContext with the following code.
Essentially a thread monitors for a boolean switch and then calls StreamingContext.stop(true,true)
Everything seems to process and all my data appears to have been collected. However, I get the following exception on shutdown.
Can I ignore? It looks like there is potential for data loss.
18/03/07 11:46:40 WARN ReceivedBlockTracker: Exception thrown while
writing record: BatchAllocationEvent(1520452000000
ms,AllocatedBlocks(Map(0 -> ArrayBuffer()))) to the WriteAheadLog.
java.lang.IllegalStateException: close() was called on
BatchedWriteAheadLog before write request with time 1520452000001
could be fulfilled.
at org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:86)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:234)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.allocateBlocksToBatch(ReceivedBlockTracker.scala:118)
at org.apache.spark.streaming.scheduler.ReceiverTracker.allocateBlocksToBatch(ReceiverTracker.scala:213)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:248)
The Thread
var stopScc=false
private def stopSccThread(): Unit = {
val thread = new Thread {
override def run {
var continueRun=true
while (continueRun) {
logger.debug("Checking status")
if (stopScc == true) {
getSparkStreamingContext(fieldVariables).stop(true, true)
logger.info("Called Stop on Streaming Context")
continueRun=false
}
Thread.sleep(50)
}
}
}
thread.start
}
The Stream
#throws(classOf[IKodaMLException])
def startStream(ip: String, port: Int): Unit = {
try {
val ssc = getSparkStreamingContext(fieldVariables)
ssc.checkpoint("./ikoda/cp")
val lines = ssc.socketTextStream(ip, port, StorageLevel.MEMORY_AND_DISK_SER)
lines.print
val lmap = lines.map {
l =>
if (l.contains("IKODA_END_STREAM")) {
stopScc = true
}
l
}
lmap.foreachRDD {
r =>
if (r.count() > 0) {
logger.info(s"RECEIVED: ${r.toString()} first: ${r.first().toString}")
r.saveAsTextFile("./ikoda/test/test")
}
else {
logger.info("Empty RDD. No data received")
}
}
ssc.start()
ssc.awaitTermination()
}
catch {
case e: Exception =>
logger.error(e.getMessage, e)
throw new IKodaMLException(e.getMessage, e)
}
I had the same issue and calling close() instead of stop fixed it.

How to use GroupState timeout in Spark Structured Streaming to form a time window?

I would like to detect a continuous pattern over X minutes using Spark Structured Streaming. I wonder if I can use GroupState timeout to form a time window.
What I want to do is, once I detect the first occurrence of a pattern in an object (EntityMetric), check if this pattern continuously occurs for all the subsequent EntityMetric objects in the stream within X minutes. After X minutes has passed, return an Alert object.
Here is the code I have but somehow I never see state.hasTimedOut() becomes true. I wonder if I missed something here? Any help is much appreciated. Thanks!
Dataset<EntityMetric> ems = spark
.readStream()...
KeyValueGroupedDataset<Integer, EntityMetric> groupedEm = ems.groupByKey((MapFunction<EntityMetric, Integer>) m -> m.<Integer>getEntityId(), Encoders.INT());
MapGroupsWithStateFunction<Integer, EntityMetric, Alert, Alert> continuousViolationsFunc = new MapGroupsWithStateFunction<Integer, EntityMetric, Alert, Alert>() {
#Override
public Alert call(Integer entityId, Iterator<EntityMetric> events, GroupState<Alert> state)
throws Exception {
Alert currentAlert = null;
Alert newAlert = null;
…
…
if (state.hasTimedOut()) {
// How come state.hasTimedOut() is never true?
state.remove();
} else if(state.exists()) {
currentAlert = state.get();
while (events.hasNext()) {
EntityMetric e = events.next();
// Pattern matching logic that instantiates and populates newAlert…
}
if(newAlert != null) {
state.update(newAlert);
}
} else {
boolean startTimer = false;
// For the first occurrence…
while (events.hasNext()) {
EntityMetric e = events.next();
// Pattern matching logic that set startTimer to true…
}
if(startTimer) {
state.update(newAlert);
state.setTimeoutDuration("1 minutes");
}
}
return newAlert;
}
};
Dataset<Alert> alerts = groupedEm.mapGroupsWithState(
continuousViolationsFunc,
Encoders.bean(Alert.class),
Encoders.bean(Alert.class),
GroupStateTimeout.ProcessingTimeTimeout());
StreamingQuery query = alerts
.writeStream()
.format("console")
.outputMode(OutputMode.Update())
.start();

Hazelcast IMap tryLock bulk multiple keys

I have a clustered program where each thread wants to lock a set of keys.
As I understood the simplest solution using hazelcast:
private void lock(SortedSet<String> objects) {
try {
IMap<String, String> lockMap = getLockMap();
for (; ; ) {
SortedSet<String> lockedObjects = new TreeSet<>();
for (String object : objects) {
try {
boolean locked = lockMap.tryLock(object, 0, null, maxBlockTime, TimeUnit.MILLISECONDS);
if (locked) {
lockedObjects.add(object);
} else {
for (String lockedObject : lockedObjects) {
lockMap.unlock(lockedObject);
}
lockedObjects.clear();
break;
}
} catch (Exception e) {
for (String lockedObject : lockedObjects) {
try {
lockMap.unlock(lockedObject);
} catch(Exception ignored) {
}
}
throw e;
}
}
if (!lockedObjects.isEmpty()) {
return lockedObjects;
}
Thread.sleep(waitTime);
}
}
}
The main problem of this code that this code generates a lot of network traffics and requests to hazelcast. Could somebody recommend how this code can be optimized?
I couldn't find bulk tryLock functionality in the Hazelcast IMap.
Hazelcast does not offer a bulkLock method.
You can optimize this code in various ways, but before we get to that it would be great to know why you want to lock these entries and what you are trying to achieve.

How to read InputStream only once using CustomReceiver

I have written custom receiver to receive the stream that is being generated by one of our application. The receiver starts the process gets the stream and then cals store. However, the receive method gets called multiple times, I have written proper loop break condition, but, could not do it. How to ensure it only reads once and does not read the already processed data.?
Here is my custom receiver code:
class MyReceiver() extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging {
def onStart() {
new Thread("Splunk Receiver") {
override def run() { receive() }
}.start()
}
def onStop() {
}
private def receive() {
try {
/* My Code to run a process and get the stream */
val reader = new ResultsReader(job.getResults()); // ResultReader is reader for the appication
var event:String = reader.getNextLine;
while (!isStopped || event != null) {
store(event);
event = reader.getNextLine;
}
reader.close()
} catch {
case t: Throwable =>
restart("Error receiving data", t)
}
}
}
Where did i go wrong.?
Problems
1) The job and stream reading happening after every 2 seconds and same data is piling up. So, for 60 line of data, i am getting 1800 or greater some times, in total.
Streaming Code:
val conf = new SparkConf
conf.setAppName("str1");
conf.setMaster("local[2]")
conf.set("spark.driver.allowMultipleContexts", "true");
val ssc = new StreamingContext(conf, Minutes(2));
val customReceiverStream = ssc.receiverStream(new MyReceiver)
println(" searching ");
//if(customReceiverStream.count() > 0 ){
customReceiverStream.foreachRDD(x => {println("=====>"+ x.count());x.count()});
//}
ssc.start();
ssc.awaitTermination()
Note: I am trying this in my local cluster, and with master as local[2].

Resources