multi threading storm Bolt parallelism

multi threading storm Bolt parallelism - multithreading

Components for Testing are
Kafka Procucer that reads a file from machine, file is composed of 1000 lines.
String sCurrentLine;
br = new BufferedReader(new FileReader("D:\\jsonLogTest.txt"));
while ((sCurrentLine = br.readLine()) != null) {
//System.out.println(sCurrentLine);
KeyedMessage<String, String> message =new KeyedMessage<String, String>(TOPIC,sCurrentLine);
producer.send(message);
}
Storm Consumer with Three Bolts, BoltOne is supposed to receive the stream and divide it on two different Streams (Stream1 & Stream2). BoltTwo and BoltThree should subscribe to these Streams.
(In simple words I am looking to process the tuple in BoltOne parley like Bolt2 processes first 500 lines and BolltThree last 500 lines.
Topology
builder.setSpout("line-reader-spout",kafkaSpout,1);
builder.setBolt("bolt-one", new BoltOne(),1).shuffleGrouping("line-reader-spout");
builder.setBolt("bolt-two", new BoltTwo(),1).shuffleGrouping("bolt-one","stream1");
builder.setBolt("bolt-three", new BoltThree(),1).shuffleGrouping("bolt-one","stream2");
BoltOne
collector.emit("stream1", new Values(input.getString(0)));
collector.emit("stream2", new Values(input.getString(0)));
x++;System.out.println("" + x);
collector.ack(input);
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
// TODO Auto-generated method stub
outputFieldsDeclarer.declareStream("stream1", new Fields("field1"));
outputFieldsDeclarer.declareStream("stream2", new Fields("field2"));
}
BoltTwo & BoltThree
public void execute(Tuple input) {
String sentence = input.getString(0);
System.out.println("*********B2*************");
}
StackTrace
*********B2*************
1
*********B3*************
2
*********B2*************
*********B3*************
3
*********B3*************
*********B2*************
4
*********B3*************
*********B2*************
5
*********B2*************
*********B3*************
6
*********B2*************
*********B3*************
7
*********B3*************
*********B2*************
Totally confused with splitting streams and parallelism. Example would be helpful.
Updated Solution I came up with for now :
public void execute(Tuple input) {
#SuppressWarnings("unused")
String sentence = input.getString(0);
if (x%2==0) {
collector.emit("stream1", new Values(input.getString(0)));
}
else{
collector.emit("stream2", new Values(input.getString(0)));
}
x++;
collector.ack(input);
}
I just divided the stream on even-Odd basis, and the time to process becomes half, While the BoltTwo processes a tuple other is processed by BoltThree.

I guess you run everything using LocalCluster. As there are multiple threads running, the output via println(...) is not synchronized and internal buffering can mess up the order of the output... Thus, the stuff you see in not reliable -- the order is only preserved within a single spout/bolt.
Furthermore, what is the behavior you want to get?
Right now, you have
Spout => Bolt1 =+=> Bolt2
+=> Bolt3
Ie, the output of Bolt1 is duplicated and Bolt2 and Bolt3 both receive all output tuple from Bolt1. Thus, Bolt1 counts from 1 to 7, and each output tuple of Bolt1 triggers execute() of Bolt2 and Bolt3.
As Bolt2 and Bolt3 do the same thing, I guess you want to have two copies of the same Bolt and partition the input to both. For this, you only add a single bolt and set the parallelism to 2:
builder.setSpout("line-reader-spout",kafkaSpout,1);
builder.setBolt("bolt-one", new BoltOne(),1).shuffleGrouping("line-reader-spout");
builder.setBolt("bolt-two", new BoltTwo(),2).shuffleGrouping("bolt-one","stream1");
Furthermore, Bolt1 only need to declare a single output stream (and not two). If you declare multiple output streams and write to both, you replicate the data...

Related

Getting no response from ProcessBuilder's input stream in some specific case

So I am trying to get battery status from linux, and so far the first command (path variable) returns perfectly and I am able to get its response in form of Sequence from the input stream, but unfortunately the second command (of result variable) returns empty sequence.
fun getLinuxBatteryStatus(): Nothing? {
val path = """upower --enumerate""".runCommand() ?: return null
val parameters = listOf("present", "state", "energy-full", "energy", "energy-rate", "time to empty", "percentage")
val result = """upower -i ${path.first { "battery_BAT" in it }} | grep -E "${parameters.joinToString("|")}""""
.also { println(it) }
.runCommand() ?: return null
result.forEach(::println) // <- no ouput
// println(result.count()) // <- 0
/* Do other thing and return something (that is not related to problem) */
}
Ouput:
upower -i /org/freedesktop/UPower/devices/battery_BAT1 | grep -E "present|state|energy-full|energy|energy-rate|time to empty|percentage"
The above output is from the also block in the last command, just to preview the command's string for debugging. And if I run the above command directly into the terminal I am successfully getting the responses as follows:
present: yes
state: charging
energy: 47.903 Wh
energy-empty: 0 Wh
energy-full: 50.299 Wh
energy-full-design: 48.004 Wh
energy-rate: 17.764 W
percentage: 95%
Why is the last command not working (not returning any response) with the ProcessBuilder?
Note: the extension function runCommand has been taken from here
private fun String.runCommand(
workingDir: File = File("."),
timeoutAmount: Long = 60,
timeoutUnit: TimeUnit = TimeUnit.SECONDS
): Sequence<String>? = try {
ProcessBuilder(split("\\s".toRegex()))
.directory(workingDir)
.redirectOutput(ProcessBuilder.Redirect.PIPE)
.redirectError(ProcessBuilder.Redirect.PIPE)
.start()
.apply { waitFor(timeoutAmount, timeoutUnit) }
.inputStream.bufferedReader().lineSequence()
} catch (e: IOException) {
e.printStackTrace()
null
}

The problem here is the pipe.
You're trying to run a pipeline — a construction involving running multiple programs, that needs a shell to interpret.
But ProcessBuilder runs a single program.  In this case, it's running the program upower and passing it the parameters -i, /org/freedesktop/UPower/devices/battery_BAT1, |, grep, -E, and "present|state|energy-full|energy|energy-rate|time to empty|percentage".  Obviously upower won't know what to do with the | parameter or those after it.
You could use ProcessBuilder to run up a shell instance, which could then run your pipeline; see this answer.
But it would probably be simpler, safer, and more efficient to do the filtering in your own code, and avoid calling grep entirely.
I recommend capturing the process's error output, which would very probably have made the problem clear.

Max Aggregation with Hazelcast-jet

I want to do a simple max across an entire dataset. I started with the Kafka example at: https://github.com/hazelcast/hazelcast-jet-code-samples/blob/0.7-maintenance/kafka/src/main/java/avro/KafkaAvroSource.java
I just changed the pipeline to:
p.drawFrom(KafkaSources.<Integer, User>kafka(brokerProperties(), TOPIC))
.map(Map.Entry::getValue)
.rollingAggregate(minBy(comparingInt(user -> (Integer) user.get(2))))
.map(user -> (Integer) user.get(2))
.drainTo(Sinks.list("result"));
and the go to:
IListJet<Integer> res = jet.getList("result");
SECONDS.sleep(10);
System.out.println(res.get(0));
SECONDS.sleep(15);
System.out.println(res.get(0));
cancel(job);
to get the largest age of people in the topic. It however doesn't return 20 and seems to return different values on different runs. Any idea why?

You seem to be using rollingAggregate, which produces a new output item every time it receives some input, but all you check is the first item it emitted. You must instead find the latest item it emitted. One way to achieve it is by pushing the result into an IMap sink, using the same key every time:
p.drawFrom(KafkaSources.<Integer, User>kafka(brokerProperties(), TOPIC))
.withoutTimestamps()
.map(Map.Entry::getValue)
.rollingAggregate(minBy(comparingInt(user -> (Integer) user.get(2))))
.map(user -> entry("user", (Integer) user.get(2)))
.drainTo(Sinks.map("result"));
You can fetch the latest result with
IMap<String, Integer> result = jet.getMap("result");
System.out.println(result.get("user");

Spark stream data from IBM MQ

I want to stream data from IBM MQ. I have tried out this code I found on Github.
I am able to stream data from the Queue but each time it streams, it takes all the data from it. I just want to take the current data that is pushed into the queue. I looked up on many sites but didn't find the correct solution.
In Kafka we had something like KafkaStreamUtils for streaming the near-real-time data. Is there anything similar to that in IBM MQ so that it streams only the latest data?

The sample in the link you provided shows that it calls the following method to recieve from the the IBM MQ:
CustomMQReciever(String host , int port, String qm, String channel, String qn)
If you review CustomMQReciever here you can see that it is only Browsing the messages from the queue. This means the message will still be on the queue and the next time you connect you will receive the same messages:
MQQueueBrowser browser = (MQQueueBrowser) qSession.createBrowser(queue);
If you wanted to remove the messages from the queue you would need to call a method that does consume them from the queue instead of browsing them from the queue. Below is an example of changes to CustomMQReciever.java that should accomplish what you want:
Under the initConnection() change the above code to the following to cause it to remove the messages from the queue:
MQMessageConsumer consumer = (MQMessageConsumer) qSession.createConsumer(queue);
Get rid of:
enumeration= browser.getEnumeration();
Under receive() change the following:
while (!isStopped() && enumeration.hasMoreElements() )
{
receivedMessage= (JMSMessage) enumeration.nextElement();
String userInput = convertStreamToString(receivedMessage);
//System.out.println("Received data :'" + userInput + "'");
store(userInput);
}
To something like this:
while (!isStopped() && (receivedMessage = consumer.receiveNoWait()) != null))
{
String userInput = convertStreamToString(receivedMessage);
//System.out.println("Received data :'" + userInput + "'");
store(userInput);
}

Sphinx Voice Activity Detection

So I'm trying to write a simple program that will detect voice activity with a .wav file using the CMU Sphinx library.
So far, I have the following
SpeechClassifier s = new SpeechClassifier();
s.setPredecessor(dataSource);
Data d = s.getData();
while(d != null) {
if(s.isSpeech()) {
System.out.println("Speech is detected");
}
else {
System.out.println("Speech has not been detected");
}
System.out.println();
d = s.getData();
}
I get the output "Speech is not detected" but there is Speech in the audio file. It seems as if the getData function is not working the way I want it to. I want it to get the frames and then determine whether the frames (s.isSpeech()) contain speech or not.
I'm trying to have multiple outputs ("Speech is detected" vs "Speech is not detected") for each frame. How can I make my code better? Thanks!

You need to insert DataBlocker before SpeechClassifier:
DataBlocker b = new DataBlocker(10); // means 10ms
SpeechClassifier s = new SpeechClassifier(10, 0.003, 10, 0);
b.setPredecessor(dataSource);
s.setPredecessor(b);
Then it will process 10 millisecond frames.

Spark Streaming textFileStream not supporting wildcards

I setup a simple test to stream text files from S3 and got it to work when I tried something like
val input = ssc.textFileStream("s3n://mybucket/2015/04/03/")
and in the bucket I would have log files go in there and everything would work fine.
But if their was a subfolder, it would not find any files that got put into the subfolder (and yes, I am aware that hdfs doesn't actually use a folder structure)
val input = ssc.textFileStream("s3n://mybucket/2015/04/")
So, I tried to simply do wildcards like I have done before with a standard spark application
val input = ssc.textFileStream("s3n://mybucket/2015/04/*")
But when I try this it throws an error
java.io.FileNotFoundException: File s3n://mybucket/2015/04/* does not exist.
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.listStatus(NativeS3FileSystem.java:506)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1483)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1523)
at org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:176)
at org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:134)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
at scala.Option.orElse(Option.scala:257)
.....
I know for a fact that you can use wildcards when reading fileInput for a standard spark applications but it appears that when doing streaming input, it doesn't do that nor does it automatically process files in subfolders. Is there something I'm missing here??
Ultimately what I need is a streaming job to be running 24/7 that will be monitoring an S3 bucket that has logs placed in it by date
So something like
s3n://mybucket/<YEAR>/<MONTH>/<DAY>/<LogfileName>
Is there any way to hand it the top most folder and it automatically read files that show up in any folder (cause obviously the date will increase every day)?
EDIT
So upon digging into the documentation at http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources it states that nested directories are not supported.
Can anyone shed some light as to why this is the case?
Also, since my files will be nested based upon their date, what would be a good way of solving this problem in my streaming application? It's a little complicated since the logs take a few minutes to get written to S3 and so the last file being written for the day could be written in the previous day's folder even though we're a few minutes into the new day.

Some "ugly but working solution" can be created by extending FileInputDStream.
Writing sc.textFileStream(d) is equivalent to
new FileInputDStream[LongWritable, Text, TextInputFormat](streamingContext, d).map(_._2.toString)
You can create CustomFileInputDStream that will extend FileInputDStream. The custom class will copy the compute method from the FileInputDStream class and adjust the findNewFiles method to your needs.
changing findNewFiles method from:
private def findNewFiles(currentTime: Long): Array[String] = {
try {
lastNewFileFindingTime = clock.getTimeMillis()
// Calculate ignore threshold
val modTimeIgnoreThreshold = math.max(
initialModTimeIgnoreThreshold, // initial threshold based on newFilesOnly setting
currentTime - durationToRemember.milliseconds // trailing end of the remember window
)
logDebug(s"Getting new files for time $currentTime, " +
s"ignoring files older than $modTimeIgnoreThreshold")
val filter = new PathFilter {
def accept(path: Path): Boolean = isNewFile(path, currentTime, modTimeIgnoreThreshold)
}
val newFiles = fs.listStatus(directoryPath, filter).map(_.getPath.toString)
val timeTaken = clock.getTimeMillis() - lastNewFileFindingTime
logInfo("Finding new files took " + timeTaken + " ms")
logDebug("# cached file times = " + fileToModTime.size)
if (timeTaken > slideDuration.milliseconds) {
logWarning(
"Time taken to find new files exceeds the batch size. " +
"Consider increasing the batch size or reducing the number of " +
"files in the monitored directory."
)
}
newFiles
} catch {
case e: Exception =>
logWarning("Error finding new files", e)
reset()
Array.empty
}
}
to:
private def findNewFiles(currentTime: Long): Array[String] = {
try {
lastNewFileFindingTime = clock.getTimeMillis()
// Calculate ignore threshold
val modTimeIgnoreThreshold = math.max(
initialModTimeIgnoreThreshold, // initial threshold based on newFilesOnly setting
currentTime - durationToRemember.milliseconds // trailing end of the remember window
)
logDebug(s"Getting new files for time $currentTime, " +
s"ignoring files older than $modTimeIgnoreThreshold")
val filter = new PathFilter {
def accept(path: Path): Boolean = isNewFile(path, currentTime, modTimeIgnoreThreshold)
}
val directories = fs.listStatus(directoryPath).filter(_.isDirectory)
val newFiles = ArrayBuffer[FileStatus]()
directories.foreach(directory => newFiles.append(fs.listStatus(directory.getPath, filter) : _*))
val timeTaken = clock.getTimeMillis() - lastNewFileFindingTime
logInfo("Finding new files took " + timeTaken + " ms")
logDebug("# cached file times = " + fileToModTime.size)
if (timeTaken > slideDuration.milliseconds) {
logWarning(
"Time taken to find new files exceeds the batch size. " +
"Consider increasing the batch size or reducing the number of " +
"files in the monitored directory."
)
}
newFiles.map(_.getPath.toString).toArray
} catch {
case e: Exception =>
logWarning("Error finding new files", e)
reset()
Array.empty
}
}
will check for files in all first degree sub folders, you can adjust it to use the batch timestamp in order to access the relevant "subdirectories".
I created the CustomFileInputDStream as I mentioned and activated it by calling:
new CustomFileInputDStream[LongWritable, Text, TextInputFormat](streamingContext, d).map(_._2.toString)
It seems to behave us expected.
When I write solution like this I must add some points for consideration:
You are breaking Spark encapsulation and creating a custom class that you would have to support solely as time pass.
I believe that solution like this is the last resort. If your use case can be implemented by different way, it is usually better to avoid solution like this.
If you will have a lot of "subdirectories" on S3 and would check each one of them it will cost you.
It will be very interesting to understand if Databricks doesn't support nested files just because of possible performance penalty or not, maybe there is a deeper reason I haven't thought about.

we had same problem. we joined sub folder names with comma.
List<String> paths = new ArrayList<>();
SimpleDateFormat sdf = new SimpleDateFormat("yyyy/MM/dd");
try {
Date start = sdf.parse("2015/02/01");
Date end = sdf.parse("2015/04/01");
Calendar calendar = Calendar.getInstance();
calendar.setTime(start);
while (calendar.getTime().before(end)) {
paths.add("s3n://mybucket/" + sdf.format(calendar.getTime()));
calendar.add(Calendar.DATE, 1);
}
} catch (ParseException e) {
e.printStackTrace();
}
String joinedPaths = StringUtils.join(",", paths.toArray(new String[paths.size()]));
val input = ssc.textFileStream(joinedPaths);
I hope that in this way your problem is solved.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

multi threading storm Bolt parallelism - multithreading

Related

Getting no response from ProcessBuilder's input stream in some specific case

Max Aggregation with Hazelcast-jet

Spark stream data from IBM MQ

Sphinx Voice Activity Detection

Spark Streaming textFileStream not supporting wildcards

Categories

Resources