How is this word count bolt thread safe? - multithreading

I am new with storm, I was going through the word count example of storm. Here is the bolt which keeps track of the counts
public static class WordCount extends BaseBasicBolt {
Map<String, Integer> counts = new HashMap<String, Integer>();
#Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
String word = tuple.getString(0);
Integer count = counts.get(word);
if (count == null)
count = 0;
count++;
counts.put(word, count);
collector.emit(new Values(word, count));
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
From my understand, storm launches multiple threads for bolt which run in parallel, so how is using HashMap here thread-safe?

[Bolts and spouts] do not need to be thread-safe. Each task has their own
instance of the bolt or spout object they're executing, and for any
given task Storm calls all spout/bolt methods on a single thread.

Related

How can I parallel consumption kafka with spark streaming? I set concurrentJobs but something error [duplicate]

The doc of kafka give an approach about with following describes:
One Consumer Per Thread:A simple option is to give each thread its own consumer > instance.
My code:
public class KafkaConsumerRunner implements Runnable {
private final AtomicBoolean closed = new AtomicBoolean(false);
private final CloudKafkaConsumer consumer;
private final String topicName;
public KafkaConsumerRunner(CloudKafkaConsumer consumer, String topicName) {
this.consumer = consumer;
this.topicName = topicName;
}
#Override
public void run() {
try {
this.consumer.subscribe(topicName);
ConsumerRecords<String, String> records;
while (!closed.get()) {
synchronized (consumer) {
records = consumer.poll(100);
}
for (ConsumerRecord<String, String> tmp : records) {
System.out.println(tmp.value());
}
}
} catch (WakeupException e) {
// Ignore exception if closing
System.out.println(e);
//if (!closed.get()) throw e;
}
}
// Shutdown hook which can be called from a separate thread
public void shutdown() {
closed.set(true);
consumer.wakeup();
}
public static void main(String[] args) {
CloudKafkaConsumer kafkaConsumer = KafkaConsumerBuilder.builder()
.withBootstrapServers("172.31.1.159:9092")
.withGroupId("test")
.build();
ExecutorService executorService = Executors.newFixedThreadPool(5);
executorService.execute(new KafkaConsumerRunner(kafkaConsumer, "log"));
executorService.execute(new KafkaConsumerRunner(kafkaConsumer, "log.info"));
executorService.shutdown();
}
}
but it doesn't work and throws an exception:
java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access
Furthermore, I read the source of Flink (an open source platform for distributed stream and batch data processing). Flink using multi-thread consumer is similar to mine.
long pollTimeout = Long.parseLong(flinkKafkaConsumer.properties.getProperty(KEY_POLL_TIMEOUT, Long.toString(DEFAULT_POLL_TIMEOUT)));
pollLoop: while (running) {
ConsumerRecords<byte[], byte[]> records;
//noinspection SynchronizeOnNonFinalField
synchronized (flinkKafkaConsumer.consumer) {
try {
records = flinkKafkaConsumer.consumer.poll(pollTimeout);
} catch (WakeupException we) {
if (running) {
throw we;
}
// leave loop
continue;
}
}
flink code of mutli-thread
What's wrong?
Kafka consumer is not thread safe. As you pointed out in your question, the document stated that
A simple option is to give each thread its own consumer instance
But in your code, you have the same consumer instance wrapped by different KafkaConsumerRunner instances. Thus multiple threads are accessing the same consumer instance. The kafka documentation clearly stated
The Kafka consumer is NOT thread-safe. All network I/O happens in the
thread of the application making the call. It is the responsibility of
the user to ensure that multi-threaded access is properly
synchronized. Un-synchronized access will result in
ConcurrentModificationException.
That's exactly the exception you received.
It is throwing the exception on your call to subscribe. this.consumer.subscribe(topicName);
Move that block into a synchronized block like this:
#Override
public void run() {
try {
synchronized (consumer) {
this.consumer.subscribe(topicName);
}
ConsumerRecords<String, String> records;
while (!closed.get()) {
synchronized (consumer) {
records = consumer.poll(100);
}
for (ConsumerRecord<String, String> tmp : records) {
System.out.println(tmp.value());
}
}
} catch (WakeupException e) {
// Ignore exception if closing
System.out.println(e);
//if (!closed.get()) throw e;
}
}
Maybe is not your case, but if you are mergin processing of data of serveral topics, then you can read data from multiple topics with the same consumer. If not, then is preferable to create separate jobs consuming each topic.

EntryProcessor without locking entries

In my application, I'm trying to process data in IMap, the scenario is as follows:
application recieves request (REST for example) with set of keys to be processed
application processes entries with given key and returns result - map where key is original key of the entry and result is calculated
for this scenario IMap.executeOnKeys is almost perfect, with one problem - the entry is locked while being processed - and really it hurts thruput. The IMap is populated on startup and never modified.
Is it possible to process entries without locking them? If possible without sending entries to another node and without causing network overhead (sending 1000 tasks to single node in for-loop)
Here is reference implementation to demonstrate what I'm trying to achieve:
public class Main {
public static void main(String[] args) throws Exception {
HazelcastInstance instance = Hazelcast.newHazelcastInstance();
IMap<String, String> map = instance.getMap("the-map");
// populated once on startup, never modified
for (int i = 1; i <= 10; i++) {
map.put("key-" + i, "value-" + i);
}
Set<String> keys = new HashSet<>();
keys.add("key-1"); // every requst may have different key set, they may overlap
System.out.println(" ---- processing ----");
ForkJoinPool pool = new ForkJoinPool();
// to simulate parallel requests on the same entry
pool.execute(() -> map.executeOnKeys(keys, new MyEntryProcessor("first")));
pool.execute(() -> map.executeOnKeys(keys, new MyEntryProcessor("second")));
System.out.println(" ---- pool is waiting ----");
pool.shutdown();
pool.awaitTermination(5, TimeUnit.MINUTES);
System.out.println(" ------ DONE -------");
}
static class MyEntryProcessor implements EntryProcessor<String, String> {
private String name;
MyEntryProcessor(String name) {
this.name = name;
}
#Override
public Object process(Map.Entry<String, String> entry) {
System.out.println(name + " is processing " + entry);
return calculate(entry); // may take some time, doesn't modify entry
}
#Override
public EntryBackupProcessor<String, String> getBackupProcessor() {
return null;
}
}
}
Thanks in advance
In executeOnKeys the entries are not locked. Maybe you mean that the processing happens on partitionThreads, so that there may be no other processing for the particular key? Anyhow, here's the solution:
Your EntryProcessor should implement:
Offloadable interface -> this means that the partition-thread will be used only for reading the value. The calculation will be done in the offloading thread-pool.
ReadOnly interface -> in this case the EP won't hop on the partition-thread again to save the modification you might have done in the entry. Since your EP does not modify entries, this will increase the performance.

While modifying ArrayList with one thread and iterating it with another thread, it is throwing ConcurrentModificationException

I was trying below code.
public class IteratorFailFastTest {
private List<Integer> list = new ArrayList<>();
public IteratorFailFastTest() {
for (int i = 0; i < 10; i++) {
list.add(i);
}
}
public void runUpdateThread() {
Thread thread2 = new Thread(new Runnable() {
public void run() {
for (int i = 10; i < 20; i++) {
list.add(i);
}
}
});
thread2.start();
}
public void runIteratorThread() {
Thread thread1 = new Thread(new Runnable() {
public void run() {
ListIterator<Integer> iterator = list.listIterator();
while (iterator.hasNext()) {
Integer number = iterator.next();
System.out.println(number);
}
}
});
thread1.start();
}
public static void main(String[] args) {
// TODO Auto-generated method stub
IteratorFailFastTest tester = new IteratorFailFastTest();
tester.runIteratorThread();
tester.runUpdateThread();
}
}
It is throwing ConcurrentModificationException sometimes and at times running successfully.
What I don't understand is, since there are 2 different methods each containing one thread. They will execute one by one. When one thread finishes modifying the list, Thread 2 will start iterating.
I also referred to this link(Why no ConcurrentModificationException when one thread iterating (using Iterator) and other thread modifying same copy of non-thread-safe ArrayList), but it is different scenario.
So, Why is it throwing this exception? Is it because of threads?
Can somebody explain?
You are starting two threads and then doing no further synchronization.
Sometimes, both threads will be running at the same time, and you will get the CME. Other times, the first thread will finish before the second thread actually starts executing. In that scenario won't get a CME.
The reason you get the variation could well be down to things like load on your system. Or it could simply be down to the fact that the thread scheduler is non-deterministic.
Your threads actually do a tiny amount of work, compared to the overheads of creating / starting a thread. So it is not surprising that one of them can return from its run() method very quickly.

GroovyShell in Java8 : memory leak / duplicated classes [src code + load test provided]

We have a memory leak caused by GroovyShell/ Groovy scripts (see GroovyEvaluator code at the end). Main problems are (copy-paste from MAT analyser):
The class "java.beans.ThreadGroupContext", loaded by "<system class
loader>", occupies 807,406,960 (33.38%) bytes.
and:
16 instances of
"org.codehaus.groovy.reflection.ClassInfo$ClassInfoSet$Segment",
loaded by "sun.misc.Launcher$AppClassLoader # 0x7004e9c80" occupy
1,510,256,544 (62.44%) bytes
We're using Groovy 2.3.11 and Java8 (1.8.0_25 to be exact).
Upgrading to Groovy 2.4.6 doesn't solve the problem. Just improves memory usage a little bit, esp. non-heap.
Java args we're using: -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC
BTW, I've read https://dzone.com/articles/groovyshell-and-memory-leaks. We do set GroovyShell shell to null when it's no longer needed. Using GroovyShell().parse() would probably help but it isn't really an option for us - we have >10 sets, each consisting of 20-100 scripts, and they can be changed at any time (on runtime).
Setting MaxMetaspaceSize should also help, but it doesn't really solve the root problem, doesn't remove the root cause. So I'm still trying to nail it down.
I created load test to recreate the problem (see the code at the end). When I run it:
heap size, metaspace size and number of classes keep increasing
heap dump taken after several minutes is bigger than 4GB
Performance charts for first 3 minutes:
As I've already mentioned I'm using MAT to analyse heap dumps. So let's check Dominator tree report:
Hashmap takes > 30% of the heap.
So let's analyse it further. Let's see what sits inside it. Let's check hash entries:
It reports 38 830 entiries. Including 38 780 entries with keys matching ".class Script."
Another thing, "duplicate classes" report:
We have 400 entries (because load tests defines 400 G.scripts), all for "ScriptN" classes.
All of them holding references to groovyclassloader$innerloader
I've found similar bug reported: https://issues.apache.org/jira/browse/GROOVY-7498 (see comments at the end and attached screenshot) - their problems were solved by upgrading Java to 1.8u51. It didn't do a trick for us though.
Our code:
public class GroovyEvaluator
{
private GroovyShell shell;
public GroovyEvaluator()
{
this(Collections.<String, Object>emptyMap());
}
public GroovyEvaluator(final Map<String, Object> contextVariables)
{
shell = new GroovyShell();
for (Map.Entry<String, Object> contextVariable : contextVariables.entrySet())
{
shell.setVariable(contextVariable.getKey(), contextVariable.getValue());
}
}
public void setVariables(final Map<String, Object> answers)
{
for (Map.Entry<String, Object> questionAndAnswer : answers.entrySet())
{
String questionId = questionAndAnswer.getKey();
Object answer = questionAndAnswer.getValue();
shell.setVariable(questionId, answer);
}
}
public Object evaluateExpression(String expression)
{
return shell.evaluate(expression);
}
public void setVariable(final String name, final Object value)
{
shell.setVariable(name, value);
}
public void close()
{
shell = null;
}
}
Load test:
/** Run using -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC */
public class GroovyEvaluatorLoadTest
{
private static int NUMBER_OF_QUESTIONS = 400;
private final Map<String, Object> contextVariables = Collections.emptyMap();
private List<Fact> factMappings = new ArrayList<>();
public GroovyEvaluatorLoadTest()
{
for (int i=0; i<NUMBER_OF_QUESTIONS; i++)
{
factMappings.add(new Fact("fact" + i, "question" + i));
}
}
private void callEvaluateExpression(int iter)
{
GroovyEvaluator groovyEvaluator = new GroovyEvaluator(contextVariables);
Map<String, Object> factValues = new HashMap<>();
Map<String, Object> answers = new HashMap<>();
for (int i=0; i<NUMBER_OF_QUESTIONS; i++)
{
factValues.put("fact" + i, iter + "-fact-value-" + i);
answers.put("question" + i, iter + "-answer-" + i);
}
groovyEvaluator.setVariables(answers);
groovyEvaluator.setVariable("answers", answers);
groovyEvaluator.setVariable("facts", factValues);
for (Fact fact : factMappings)
{
groovyEvaluator.evaluateExpression(fact.mapping);
}
groovyEvaluator.close();
}
public static void main(String [] args)
{
GroovyEvaluatorLoadTest test = new GroovyEvaluatorLoadTest();
for (int i=0; i<995000; i++)
{
test.callEvaluateExpression(i);
}
test.callEvaluateExpression(0);
}
}
public class Fact
{
public final String factId;
public final String mapping;
public Fact(final String factId, final String mapping)
{
this.factId = factId;
this.mapping = mapping;
}
}
Any thoughts?
Thx in advance
OK, this is my solution:
public class GroovyEvaluator
{
private static GroovyScriptCachingBuilder groovyScriptCachingBuilder = new GroovyScriptCachingBuilder();
private Map<String, Object> variables = new HashMap<>();
public GroovyEvaluator()
{
this(Collections.<String, Object>emptyMap());
}
public GroovyEvaluator(final Map<String, Object> contextVariables)
{
variables.putAll(contextVariables);
}
public void setVariables(final Map<String, Object> answers)
{
variables.putAll(answers);
}
public void setVariable(final String name, final Object value)
{
variables.put(name, value);
}
public Object evaluateExpression(String expression)
{
final Binding binding = new Binding();
for (Map.Entry<String, Object> varEntry : variables.entrySet())
{
binding.setProperty(varEntry.getKey(), varEntry.getValue());
}
Script script = groovyScriptCachingBuilder.getScript(expression);
synchronized (script)
{
script.setBinding(binding);
return script.run();
}
}
}
public class GroovyScriptCachingBuilder
{
private GroovyShell shell = new GroovyShell();
private Map<String, Script> scripts = new HashMap<>();
public Script getScript(final String expression)
{
Script script;
if (scripts.containsKey(expression))
{
script = scripts.get(expression);
}
else
{
script = shell.parse(expression);
scripts.put(expression, script);
}
return script;
}
}
New solution keeps number of loaded classes and Metadata size at a constant level. Non-heap allocated memory usage = ~70 MB.
Also: there is no need to use UseConcMarkSweepGC anymore. You can choose whichever GC you want or stick with a default one :)
Synchronising access to script objects might not the best option, but the only one I found that keeps Metaspace size within reasonable level. And even better - it keeps it constant. Still. It might not be the best solution for everyone but works great for us. We have big sets of tiny scripts which means this solution is (pretty much) scalable.
Let's see some STATS for GroovyEvaluatorLoadTest with GroovyEvaluator using:
old approach with shell.evaluate(expression):
0 iterations took 5.03 s
100 iterations took 285.185 s
200 iterations took 821.307 s
script.setBinding(binding):
0 iterations took 4.524 s
100 iterations took 19.291 s
200 iterations took 33.44 s
300 iterations took 47.791 s
400 iterations took 62.086 s
500 iterations took 77.329 s
So additional advantage is: it's lightning fast compared to previous, leaking solution ;)

Measure duration of executing combineByKey function in Spark

I want to measure the time that the execution of combineByKey function needs. I always get a result of 20-22 ms (HashPartitioner) and ~350ms (without pratitioning) with the code below, independent of the file size I use (file0: ~300 kB, file1: ~3GB, file2: ~8GB)! Can this be true? Or am I doing something wrong???
JavaPairRDD<Integer, String> pairRDD = null;
JavaPairRDD<Integer, String> partitionedRDD = null;
JavaPairRDD<Integer, Float> consumptionRDD = null;
boolean partitioning = true; //or false
int partitionCount = 100; // between 1 and 200 I cant see any difference in the duration!
SparkConf conf = new SparkConf();
JavaSparkContext sc = new JavaSparkContext(conf);
input = sc.textFile(path);
pairRDD = mapToPair(input);
partitionedRDD = partition(pairRDD, partitioning, partitionsCount);
long duration = System.currentTimeMillis();
consumptionRDD = partitionedRDD.combineByKey(createCombiner, mergeValue, mergeCombiners);
duration = System.currentTimeMillis() - duration; // Measured time always the same, independent of file size (~20ms with / ~350ms without partitioning)
// Do an action
Tuple2<Integer, Float> test = consumptionRDD.takeSample(true, 1).get(0);
sc.stop();
Some helper methods (shouldn't matter):
// merging function for a new dataset
private static Function2<Float, String, Float> mergeValue = new Function2<Float, String, Float>() {
public Float call(Float sumYet, String dataSet) throws Exception {
String[] data = dataSet.split(",");
float value = Float.valueOf(data[2]);
sumYet += value;
return sumYet;
}
};
// function to sum the consumption
private static Function2<Float, Float, Float> mergeCombiners = new Function2<Float, Float, Float>() {
public Float call(Float a, Float b) throws Exception {
a += b;
return a;
}
};
private static JavaPairRDD<Integer, String> partition(JavaPairRDD<Integer, String> pairRDD, boolean partitioning, int partitionsCount) {
if (partitioning) {
return pairRDD.partitionBy(new HashPartitioner(partitionsCount));
} else {
return pairRDD;
}
}
private static JavaPairRDD<Integer, String> mapToPair(JavaRDD<String> input) {
return input.mapToPair(new PairFunction<String, Integer, String>() {
public Tuple2<Integer, String> call(String debsDataSet) throws Exception {
String[] data = debsDataSet.split(",");
int houseId = Integer.valueOf(data[6]);
return new Tuple2<Integer, String>(houseId, debsDataSet);
}
});
}
The web ui provides you with details on jobs/stage that your application has run. It details the time for each of them, and you can now filter various details such as Scheduler Delay, Task Deserialization Time, and Result Serialization Time.
The default port for the webui is 8080. Completed application are listed there, and you can then click on the name, or craft the url like this: x.x.x.x:8080/history/app-[APPID] to access those details.
I don't believe any other "built-in" methods exist to monitor the running time of a task/stage. Otherwise, you may want to go deeper and use a JVM debugging framework.
EDIT: combineByKey is a transformation, which means that it is not applied on your RDD, as opposed to actions (read more the lazy behaviour of RDDs here, chapter 3.1). I believe the time difference you're observing comes from the time SPARK takes to create the actual data structure when partitioning or not.
If a difference there is, you'll see it at action's time (takeSample here)

Resources