How to read InputStream only once using CustomReceiver - apache-spark

I have written custom receiver to receive the stream that is being generated by one of our application. The receiver starts the process gets the stream and then cals store. However, the receive method gets called multiple times, I have written proper loop break condition, but, could not do it. How to ensure it only reads once and does not read the already processed data.?
Here is my custom receiver code:
class MyReceiver() extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging {
def onStart() {
new Thread("Splunk Receiver") {
override def run() { receive() }
}.start()
}
def onStop() {
}
private def receive() {
try {
/* My Code to run a process and get the stream */
val reader = new ResultsReader(job.getResults()); // ResultReader is reader for the appication
var event:String = reader.getNextLine;
while (!isStopped || event != null) {
store(event);
event = reader.getNextLine;
}
reader.close()
} catch {
case t: Throwable =>
restart("Error receiving data", t)
}
}
}
Where did i go wrong.?
Problems
1) The job and stream reading happening after every 2 seconds and same data is piling up. So, for 60 line of data, i am getting 1800 or greater some times, in total.
Streaming Code:
val conf = new SparkConf
conf.setAppName("str1");
conf.setMaster("local[2]")
conf.set("spark.driver.allowMultipleContexts", "true");
val ssc = new StreamingContext(conf, Minutes(2));
val customReceiverStream = ssc.receiverStream(new MyReceiver)
println(" searching ");
//if(customReceiverStream.count() > 0 ){
customReceiverStream.foreachRDD(x => {println("=====>"+ x.count());x.count()});
//}
ssc.start();
ssc.awaitTermination()
Note: I am trying this in my local cluster, and with master as local[2].

Related

Spark custom receicer not get data

I'm using spark streaming to ingest my company's internal data source. I followed this tutorial to write a receiver: https://spark.apache.org/docs/latest/streaming-custom-receivers.html. But in Spark UI streaming tag, I always see 0 msgs coming in. Also I don't see any errors in driver logs. Really confused what goes wrong. (To connect to the internal data source, need to create a client, then listen() will keep running to get the new msgs) I doubt is it because of the listen mode on the data source?
My Receiver
class MyReceiver(val clientId: String, val token: String, val env: String) extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) {
def onStart() {
new Thread("My Data Source") { override def run() { receive() } }.start()
}
def onStop() { }
private def receive() {
while(!isStopped()) {
try {
val client = new Client(clientId, token, "STAGE")
client.connect()
client.listen(Client.Topic, new ClientMsgHandler() {
override def process(event: ClientMsg): Unit = {
val msg: String = event.getBody
store(msg)
}
override def onException(event: ClientEvent): Unit = {
}
})
} catch {
case ce: java.net.ConnectException =>
System.out.println("Could not connect")
case t: Throwable =>
System.out.println("Error receiving data")
}
}
}
}
==================================================================
Create Stream
class MyStream(sc: SparkContext, sqlContext: SQLContext, cpDir: String) {
def creatingFunc(): StreamingContext = {
val ssc = new StreamingContext(sc, Seconds(3))
// Set the active SQLContext so that we can access it statically within the foreachRDD
SQLContext.setActiveSession(sqlContext)
ssc.checkpoint(cpDir)
val ClientId = <Myclientid>
val Token = <Mytoken>
val env = "STAGE"
val stream = ssc.receiverStream(new MyReceiver(ClientId, Token, env))
stream.foreachRDD { rdd => println("Here"+rdd.take(10).mkString(", "))
}
ssc
}
}
==================================================================
Start Streaming
val checkpoint_dir = <my_checkpoint_dir>
val MyDataSourceStream = new MyStream(sc, sqlContext, checkpoint_dir)
val ssc = StreamingContext.getActiveOrCreate(checkpoint_dir, MyDataSourceStream.creatingFunc _)
ssc.start()
ssc.awaitTermination()
Updates:
Since it's an internal source, I cannot share Client source code. But I've tested the connection. It works for below code and msg can be printed out correctly. You can think of Client as an external lib which has no connection issues.
val ClientId = <myclientid>
val Token = <mytoken>
val client = new EVClient(ClientId, Token, "STAGE")
client.connect()
client.listen(Client.Topic, new ClientMsgHandler() {
override def onEvent(event: ClientMsg): Unit = {
val res = event.getBody
println(res)
}
override def onException(event: ClientEvent): Unit = {
}
})

Multithreaded Kafka Consumer not processing all the partitions in parallel

I have created a multithreaded Kafka consumer in which one thread is assigned to each of the partition (I have total 100 partitions). I have followed https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example link.
Below is the init method of my consumer.
consumer = kafka.consumer.Consumer.createJavaConsumerConnector(createConsumerConfig());
System.out.println("Kafka Consumer initialized.");
Map<String, Integer> topicCountMap = new HashMap<String, Integer>();
topicCountMap.put(topicName, 100);
Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumer.createMessageStreams(topicCountMap);
List<KafkaStream<byte[], byte[]>> streams = consumerMap.get(topicName);
executor = Executors.newFixedThreadPool(100);
In the above init method, I got the list of Kafka streams (total 100) which should be connected to each of the partition (Which is happening as expected).
Then I did submit each of the streams to a different thread using below snippet.
public Object call() {
for (final KafkaStream stream : streams) {
executor.execute(new StreamWiseConsumer(stream));
}
return true;
}
Below is the StreamWiseConsumer class.
public class StreamWiseConsumer extends Thread {
ConsumerIterator<byte[], byte[]> consumerIterator;
private KafkaStream m_stream;
public StreamWiseConsumer(ConsumerIterator<byte[], byte[]> consumerIterator) {
this.consumerIterator = consumerIterator;
}
public StreamWiseConsumer(KafkaStream kafkaStream) {
this.m_stream = kafkaStream;
}
#Override
public void run() {
ConsumerIterator<byte[], byte[]> consumerIterator = m_stream.iterator();
while(!Thread.currentThread().isInterrupted() && !interrupted) {
try {
if (consumerIterator.hasNext()) {
String reqId = UUID.randomUUID().toString();
System.out.println(reqId+ " : Event received by threadId : "+Thread.currentThread().getId());
MessageAndMetadata<byte[], byte[]> messageAndMetaData = consumerIterator.next();
byte[] keyBytes = messageAndMetaData.key();
String key = null;
if (keyBytes != null) {
key = new String(keyBytes);
}
byte[] eventBytes = messageAndMetaData.message();
if (eventBytes == null){
System.out.println("Topic: No event fetched for transaction Id:" + key);
continue;
}
String event = new String(eventBytes).trim();
// Some Processing code
System.out.println(reqId+" : Processing completed for threadId = "+Thread.currentThread().getId());
consumer.commitOffsets();
} catch (Exception ex) {
}
}
}
}
Ideally, it should start processing from all the 100 partitions in parallel. But it is picking some random number of events from one of the threads and processing it then some other thread starts processing from another partition. It seems like it's sequential processing but with different-different threads. I was expecting processing to happen from all the 100 threads. Am I missing something here?
PFB for the logs link.
https://drive.google.com/file/d/14b7gqPmwUrzUWewsdhnW8q01T_cQ30ES/view?usp=sharing
https://drive.google.com/file/d/1PO_IEsOJFQuerW0y-M9wRUB-1YJuewhF/view?usp=sharing
I doubt whether this is the right approach for vertically scaling kafka streams.
Kafka streams inherently supports multi thread consumption.
Increase the number of threads used for processing by using num.stream.threads configuration.
If you want 100 threads to process the 100 partitions, set num.stream.threads as 100.

Spark Streaming: Exception thrown while > writing record: BatchAllocationEvent

I shut down a Spark StreamingContext with the following code.
Essentially a thread monitors for a boolean switch and then calls StreamingContext.stop(true,true)
Everything seems to process and all my data appears to have been collected. However, I get the following exception on shutdown.
Can I ignore? It looks like there is potential for data loss.
18/03/07 11:46:40 WARN ReceivedBlockTracker: Exception thrown while
writing record: BatchAllocationEvent(1520452000000
ms,AllocatedBlocks(Map(0 -> ArrayBuffer()))) to the WriteAheadLog.
java.lang.IllegalStateException: close() was called on
BatchedWriteAheadLog before write request with time 1520452000001
could be fulfilled.
at org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:86)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:234)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.allocateBlocksToBatch(ReceivedBlockTracker.scala:118)
at org.apache.spark.streaming.scheduler.ReceiverTracker.allocateBlocksToBatch(ReceiverTracker.scala:213)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:248)
The Thread
var stopScc=false
private def stopSccThread(): Unit = {
val thread = new Thread {
override def run {
var continueRun=true
while (continueRun) {
logger.debug("Checking status")
if (stopScc == true) {
getSparkStreamingContext(fieldVariables).stop(true, true)
logger.info("Called Stop on Streaming Context")
continueRun=false
}
Thread.sleep(50)
}
}
}
thread.start
}
The Stream
#throws(classOf[IKodaMLException])
def startStream(ip: String, port: Int): Unit = {
try {
val ssc = getSparkStreamingContext(fieldVariables)
ssc.checkpoint("./ikoda/cp")
val lines = ssc.socketTextStream(ip, port, StorageLevel.MEMORY_AND_DISK_SER)
lines.print
val lmap = lines.map {
l =>
if (l.contains("IKODA_END_STREAM")) {
stopScc = true
}
l
}
lmap.foreachRDD {
r =>
if (r.count() > 0) {
logger.info(s"RECEIVED: ${r.toString()} first: ${r.first().toString}")
r.saveAsTextFile("./ikoda/test/test")
}
else {
logger.info("Empty RDD. No data received")
}
}
ssc.start()
ssc.awaitTermination()
}
catch {
case e: Exception =>
logger.error(e.getMessage, e)
throw new IKodaMLException(e.getMessage, e)
}
I had the same issue and calling close() instead of stop fixed it.

Converting UnixTimestamp to TIMEUUID for Cassandra

I'm learning all about Apache Cassandra 3.x.x and I'm trying to develop some stuff to play around. The problem is that I want to store data into a Cassandra table which contains these columns:
id (UUID - Primary Key) | Message (TEXT) | REQ_Timestamp (TIMEUUID) | Now_Timestamp (TIMEUUID)
REQ_Timestamp has the time when the message left the client at frontend level. Now_Timestamp, on the other hand, is the time when the message is finally stored in Cassandra. I need both timestamps because I want to measure the amount of time it takes to handle the request from its origin until the data is safely stored.
Creating the Now_Timestamp is easy, I just use the now() function and it generates the TIMEUUID automatically. The problem arises with REQ_Timestamp. How can I convert that Unix Timestamp to a TIMEUUID so Cassandra can store it? Is this even possible?
The architecture of my backend is this: I get the data in a JSON from the frontend to a web service that process it and stores it in Kafka. Then, a Spark Streaming job takes that Kafka log and puts it in Cassandra.
This is my WebService that puts the data in Kafka.
#Path("/")
public class MemoIn {
#POST
#Path("/in")
#Consumes(MediaType.APPLICATION_JSON)
#Produces(MediaType.TEXT_PLAIN)
public Response goInKafka(InputStream incomingData){
StringBuilder bld = new StringBuilder();
try {
BufferedReader in = new BufferedReader(new InputStreamReader(incomingData));
String line = null;
while ((line = in.readLine()) != null) {
bld.append(line);
}
} catch (Exception e) {
System.out.println("Error Parsing: - ");
}
System.out.println("Data Received: " + bld.toString());
JSONObject obj = new JSONObject(bld.toString());
String line = obj.getString("id_memo") + "|" + obj.getString("id_writer") +
"|" + obj.getString("id_diseased")
+ "|" + obj.getString("memo") + "|" + obj.getLong("req_timestamp");
try {
KafkaLogWriter.addToLog(line);
} catch (Exception e) {
e.printStackTrace();
}
return Response.status(200).entity(line).build();
}
}
Here's my Kafka Writer
package main.java.vcemetery.webservice;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import java.util.Properties;
import org.apache.kafka.clients.producer.Producer;
public class KafkaLogWriter {
public static void addToLog(String memo)throws Exception {
// private static Scanner in;
String topicName = "MemosLog";
/*
First, we set the properties of the Kafka Log
*/
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("acks", "all");
props.put("retries", 0);
props.put("batch.size", 16384);
props.put("linger.ms", 1);
props.put("buffer.memory", 33554432);
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
// We create the producer
Producer<String, String> producer = new KafkaProducer<>(props);
// We send the line into the producer
producer.send(new ProducerRecord<>(topicName, memo));
// We close the producer
producer.close();
}
}
And finally here's what I have of my Spark Streaming job
public class MemoStream {
public static void main(String[] args) throws Exception {
Logger.getLogger("org").setLevel(Level.ERROR);
Logger.getLogger("akka").setLevel(Level.ERROR);
// Create the context with a 1 second batch size
SparkConf sparkConf = new SparkConf().setAppName("KafkaSparkExample").setMaster("local[2]");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(10));
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", "localhost:9092");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", "group1");
kafkaParams.put("auto.offset.reset", "latest");
kafkaParams.put("enable.auto.commit", false);
/* Se crea un array con los tópicos a consultar, en este caso solamente un tópico */
Collection<String> topics = Arrays.asList("MemosLog");
final JavaInputDStream<ConsumerRecord<String, String>> kafkaStream =
KafkaUtils.createDirectStream(
ssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams)
);
kafkaStream.mapToPair(record -> new Tuple2<>(record.key(), record.value()));
// Split each bucket of kafka data into memos a splitable stream
JavaDStream<String> stream = kafkaStream.map(record -> (record.value().toString()));
// Then, we split each stream into lines or memos
JavaDStream<String> memos = stream.flatMap(x -> Arrays.asList(x.split("\n")).iterator());
/*
To split each memo into sections of ids and messages, we have to use the code \\ plus the character
*/
JavaDStream<String> sections = memos.flatMap(y -> Arrays.asList(y.split("\\|")).iterator());
sections.print();
sections.foreachRDD(rdd -> {
rdd.foreachPartition(partitionOfRecords -> {
//We establish the connection with Cassandra
Cluster cluster = null;
try {
cluster = Cluster.builder()
.withClusterName("VCemeteryMemos") // ClusterName
.addContactPoint("127.0.0.1") // Host IP
.build();
} finally {
if (cluster != null) cluster.close();
}
while(partitionOfRecords.hasNext()){
}
});
});
ssc.start();
ssc.awaitTermination();
}
}
Thank you in advance.
Cassandra has no function to convert from UNIX timestamp. You have to do the conversion on client side.
Ref: https://docs.datastax.com/en/cql/3.3/cql/cql_reference/timeuuid_functions_r.html

Number of threads in Akka keep increasing. What could be wrong?

Why does the thread count keep increasing ?
LOOK AT THE BOTTOM RIGHT in this image.
The overall flow is like this:
Akka HTTP Server API
-> on http request, sendMessageTo DataProcessingActor
-> sendMessageTo StorageActor
-> sendMessageTo DataBaseActor
-> sendMessageTo IndexActor
This is the definition of Akka HTTP API ( in pseudo-code ):
Main {
path("input/") {
post {
dataProcessingActor forward message
}
}
}
Below are the actor definitions ( in pseudo-code ):
DataProcessingActor {
case message =>
message = parse message
storageActor ! message
}
StorageActor {
case message =>
indexActor ! message
databaseActor ! message
}
DataBaseActor {
case message =>
val c = get monogCollection
c.store(message)
}
IndexActor {
case message =>
elasticSearch.index(message)
}
After I run this setup, and on sending multiple HTTP requsts to "input/" HTTP endpoint, I get errors:
for( i <- 0 until 1000000) {
post("input/", someMessage+i)
}
Error:
[ERROR] [04/22/2016 13:20:54.016] [Main-akka.actor.default-dispatcher-15] [akka.tcp://Main#127.0.0.1:2558/system/IO-TCP/selectors/$a/0] Accept error: could not accept new connection
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
at akka.io.TcpListener.acceptAllPending(TcpListener.scala:107)
at akka.io.TcpListener$$anonfun$bound$1.applyOrElse(TcpListener.scala:82)
at akka.actor.Actor$class.aroundReceive(Actor.scala:480)
at akka.io.TcpListener.aroundReceive(TcpListener.scala:32)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
EDIT 1
Here is the application.conf file being used:
akka {
loglevel = "INFO"
stdout-loglevel = "INFO"
logging-filter = "akka.event.slf4j.Slf4jLoggingFilter"
actor {
default-dispatcher {
throughput = 10
}
}
actor {
provider = "akka.remote.RemoteActorRefProvider"
}
remote {
enabled-transports = ["akka.remote.netty.tcp"]
netty.tcp {
hostname = "127.0.0.1"
port = 2558
}
}
}
I figured out that ElasticSearch was the problem. I am using Java API for ElasticSearch, and that is leaking sockets because of the way it was being used from Java API. Now resolved as described here.
Below is the Elastic Search client service using Java API
trait ESClient { def getClient(): Client }
case class ElasticSearchService() extends ESClient {
def getClient(): Client = {
val client = new TransportClient().addTransportAddress(
new InetSocketTransportAddress(Config.ES_HOST, Config.ES_PORT)
)
client
}
}
This is the actor which was causing the leak:
class IndexerActor() extends Actor {
val elasticSearchSvc = new ElasticSearchService()
lazy val client = elasticSearchSvc.getClient()
override def preStart = {
// initialize index, and mappings etc.
}
def receive() = {
case message =>
// do indexing here
indexMessage(ES.client, message)
}
}
NOTE: Every time an actor instance is created, a new connection is being made.
Every invocation of new ElasticSearchService() was creating a new connection to ElasticSearch. I moved that into a separate object as shown below, and also the actor uses this object instead:
object ES {
val elasticSearchSvc = new ElasticSearchService()
lazy val client = elasticSearchSvc.getClient()
}
class IndexerActor() extends Actor {
override def preStart = {
// initialize index, and mappings etc.
}
def receive() = {
case message =>
// do indexing here
indexMessage(ES.client, message)
}
}

Resources