How to reshuffle in apache beam using spark runner - apache-spark

I am doing this simulation using the spark runner:
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).create();
Pipeline p = Pipeline.create(options);
p.apply(Create.of(1))
.apply(ParDo.of(new DoFn<Integer, Integer>() {
#ProcessElement
public void apply(#Element Integer element, OutputReceiver<Integer> outputReceiver) {
IntStream.range(0, 4_000_000).forEach(outputReceiver::output);
}
}))
.apply(Reshuffle.viaRandomKey())
.apply(ParDo.of(new DoFn<Integer, Integer>() {
#ProcessElement
public void apply(#Element Integer element, OutputReceiver<Integer> outputReceiver) {
try {
// simulate a rpc call of 10ms
Thread.sleep(10);
} catch (InterruptedException e) {
e.printStackTrace();
}
outputReceiver.output(element);
}
}));
PipelineResult result = p.run();
result.waitUntilFinish();
I am running using --runner=SparkRunner --sparkMaster=local[8] but only 1 thread is used after the reshuffle.
Why the Rechuffle is not working?
If I change the reshuffle for this:
.apply(MapElements.into(kvs(integers(), integers())).via(e -> KV.of(e % 8, e)))
.apply(GroupByKey.create())
.apply(Values.create())
.apply(Flatten.iterables())
Then I get 8 thread running.
BR, Rafael.

It looks like Reshuffle on Beam on Spark boils down to the implementation at
https://github.com/apache/beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/GroupCombineFunctions.java#L191
I wonder if in this case both rdd.context().defaultParallelism() and rdd.getNumPartitions() are 1. I've filed https://issues.apache.org/jira/browse/BEAM-10834 to investigate.
In the meantime, you can use GroupByKey to get the desired parallelism as you've indicated. (If you don't literally have integers, you could try using the hash of your element, a Math.random(), or even an incrementing counter as the key).

Related

Multithreaded Kafka Consumer not processing all the partitions in parallel

I have created a multithreaded Kafka consumer in which one thread is assigned to each of the partition (I have total 100 partitions). I have followed https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example link.
Below is the init method of my consumer.
consumer = kafka.consumer.Consumer.createJavaConsumerConnector(createConsumerConfig());
System.out.println("Kafka Consumer initialized.");
Map<String, Integer> topicCountMap = new HashMap<String, Integer>();
topicCountMap.put(topicName, 100);
Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumer.createMessageStreams(topicCountMap);
List<KafkaStream<byte[], byte[]>> streams = consumerMap.get(topicName);
executor = Executors.newFixedThreadPool(100);
In the above init method, I got the list of Kafka streams (total 100) which should be connected to each of the partition (Which is happening as expected).
Then I did submit each of the streams to a different thread using below snippet.
public Object call() {
for (final KafkaStream stream : streams) {
executor.execute(new StreamWiseConsumer(stream));
}
return true;
}
Below is the StreamWiseConsumer class.
public class StreamWiseConsumer extends Thread {
ConsumerIterator<byte[], byte[]> consumerIterator;
private KafkaStream m_stream;
public StreamWiseConsumer(ConsumerIterator<byte[], byte[]> consumerIterator) {
this.consumerIterator = consumerIterator;
}
public StreamWiseConsumer(KafkaStream kafkaStream) {
this.m_stream = kafkaStream;
}
#Override
public void run() {
ConsumerIterator<byte[], byte[]> consumerIterator = m_stream.iterator();
while(!Thread.currentThread().isInterrupted() && !interrupted) {
try {
if (consumerIterator.hasNext()) {
String reqId = UUID.randomUUID().toString();
System.out.println(reqId+ " : Event received by threadId : "+Thread.currentThread().getId());
MessageAndMetadata<byte[], byte[]> messageAndMetaData = consumerIterator.next();
byte[] keyBytes = messageAndMetaData.key();
String key = null;
if (keyBytes != null) {
key = new String(keyBytes);
}
byte[] eventBytes = messageAndMetaData.message();
if (eventBytes == null){
System.out.println("Topic: No event fetched for transaction Id:" + key);
continue;
}
String event = new String(eventBytes).trim();
// Some Processing code
System.out.println(reqId+" : Processing completed for threadId = "+Thread.currentThread().getId());
consumer.commitOffsets();
} catch (Exception ex) {
}
}
}
}
Ideally, it should start processing from all the 100 partitions in parallel. But it is picking some random number of events from one of the threads and processing it then some other thread starts processing from another partition. It seems like it's sequential processing but with different-different threads. I was expecting processing to happen from all the 100 threads. Am I missing something here?
PFB for the logs link.
https://drive.google.com/file/d/14b7gqPmwUrzUWewsdhnW8q01T_cQ30ES/view?usp=sharing
https://drive.google.com/file/d/1PO_IEsOJFQuerW0y-M9wRUB-1YJuewhF/view?usp=sharing
I doubt whether this is the right approach for vertically scaling kafka streams.
Kafka streams inherently supports multi thread consumption.
Increase the number of threads used for processing by using num.stream.threads configuration.
If you want 100 threads to process the 100 partitions, set num.stream.threads as 100.

Producer-Consumer : Parallel Programming

my question is really simple : is this program valid as a simulation of the producer-consumer problem ?
public class ProducerConsumer {
public static void main(String[] args) {
Consumers c = new Consumers(false, null);
Producer p = new Producer(true, c);
c.p = p;
p.start();
c.start();
}
}
class Consumers extends Thread {
boolean hungry; // I want to eat
Producer p;
public Consumers(boolean hungry, Producer p) {
this.hungry = hungry;
this.p = p;
}
public void run() {
while (true) {
// While the producer want to produce, don't go
while (p.nice == true) {
// Simulation of the waiting, to check if it doesn't wait and
//`eat at the same time or any bad interleavings
System.out.println("Consumer doesn't eat");
try {
sleep(500);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
for (int i = 0; i < 3; i++) {
try {
sleep(1000);
// Because the consumer eat, the producer is boring and
// want to produce, that's the meaning of the nice.
// This line makes the producer automatically wait in the
// while loop as soon as it has finished to produce.
p.nice = true;
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println("Consumer eat");
}
hungry = false;
System.out.println("\nConsumer doesn't eat anymore\n");
}
}
}
class Producer extends Thread {
boolean nice;
Consumers c;
public Producer(boolean nice, Consumers c) {
this.nice = nice;
this.c = c;
}
public void run() {
while (true) {
/**
* I begin with the producer so the producer, doesn't enter the
* loop because no food has been produce and hungry is
* exceptionally false because that's how work this program,
* so at first time the producer doesn't enter the loop.
*/
while (c.hungry == true) {
try {
sleep(500);
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println("Producer doesn't produce");
}
/**
* While the consumer wait in the while loop of its run method
* which means that nice is true the producer produce and during
* the production the consumer become hungry, which make the
* loop "enterable" for theproducer. The advantage of this is
* that the producer already knows that it has to go away after
* producing, the consumer doesn't need to tell him
* Produce become true, and it has no effect for the first round
*/
for (int i = 0; i < 3; i++) {
try {
sleep(1000);
c.hungry = true;
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println("Producer produce");
}
/**
* After a while, producer produce, the consumer is still in the
* loop, so we can tell him he can go, but we have to make
* sure that the producer doesn't pass the loop before the
* consumer goes out and set back produce to true will lead the
* consumer to be stuck again, and that's the role of the,
* c.hungry in the for loop, because the producer knows it has
* some client, it directly enter the loop and so can't
* starve the client.
*/
System.out.println("\nProducer doesn't produce anymore\n");
nice = false;
}
}
}
I didn't use any synchronization, wait or notify, so for a parallel programming problem it seems very strange, but when I run it there aren't any deadlocks, starvation or bad interleavings, the producer produces, then stop, the consumer eats and then stops and again as many time as I wanted.
Have I cheat somewhere ?
Thanks !
P.S- I don't know why but the first line of my question doesn't appear, it was just said hello
First of all, careful with the naming, "Consumers" is misleading, you are only simulating a lone consumer. Nice can also be replaced with "producing".
Secondly, you're using while(condition) sleep, which is basically the less efficient, non protected version of a semaphore wait, so you did use a form of wait.
E.G.
while (p.nice == true) {
System.out.println("Consumer doesn't eat");
try {
sleep(500);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
is your P()
System.out.println("\nProducer doesn't produce anymore\n");
nice = false;
is your V()
This method, however is both inefficient (the waiting thread is either busy waiting or sleeps for a moment while being able to go) and unprotected (because there is no protection for simultaneous access of nice and hungry, you won't be able to expand this program with more Consumers or Producers).
Hope this helps.

How does RecursiveAction work?

I downloaded some existing code from internet. I ran it with few modifications. In one scenario, I did not get what I was looking for. Here is the code -
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ForkJoinPool;
import java.util.concurrent.RecursiveAction;
public class MyRecursiveAction extends RecursiveAction{
private long workload = 0;
public MyRecursiveAction(long workload) {
this.workload = workload;
}
#Override
protected void compute() {
if(this.workload > 16) {
System.out.println("Splitting workload :: " + this.workload);
List<MyRecursiveAction> subtasks = new ArrayList<MyRecursiveAction>();
subtasks.addAll(createSubtasks());
for(RecursiveAction subtask : subtasks) {
subtask.fork();
}
}else {
System.out.println("Doing work myself1 " + this.workload);
try {
Thread.sleep(1000L);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("Done it ya " + this.workload);
}
}
private List<MyRecursiveAction> createSubtasks() {
List<MyRecursiveAction> subTasks = new ArrayList<>();
MyRecursiveAction subtask1 = new MyRecursiveAction(this.workload / 2);
MyRecursiveAction subtask2 = new MyRecursiveAction(this.workload / 2);
subTasks.add(subtask1);
subTasks.add(subtask2);
return subTasks;
}
public static void main(String[] args) {
MyRecursiveAction myRecursiveAction = new MyRecursiveAction(24);
ForkJoinPool forkJoinPool = new ForkJoinPool(4);
forkJoinPool.invoke(myRecursiveAction);
}
}
Check the following excerpt -
System.out.println("Doing work myself1 " + this.workload);
try {
Thread.sleep(1000L);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("Done it ya " + this.workload);
I added a sleep of 1 second and then I printed another statement. However if I run the code, I don't see that statement getting printed. I don't understand why. Why will that not get printed ? In fact the result of the execution is -
Splitting workload :: 24
Doing work myself1 12
Doing work myself1 12
I was expecting the following line as well - "Done it ya"..
Make workload static and volatile:
private static volatile long workload = 0;
Loose the this.workload for just workload.
Alter if statement to:
if(workload > 0) {
Then you will get to "Done it ya".
I have found the reason as to why the last line was not getting printed.This is because fork works in asynchronous way. So its altogether a different thread which sleeps for some time. In asynchronous programming, there is no need for the main thread to wait for the response to come back unless we via code add some constructs. In this case by the time thread wakes up after 1 second, the main thread is already over.
To force the main thread to wait for execution of other threads, we need to use JOIN.
ForkJoinTask.join(): This method blocks until the result of the computation is done.
So if I add the following block
for(RecursiveAction subtask : subtasks) {
subtask.join();
}
the main thread waits and we get all the expected lines printed on the console.

Stopping spark steaming context on encountering certain message on Kafka

In my Spark Streaming application I am reading data from certain Kafka topic. While reading from topic whenever I encounter certain message (for example: "poison") I want to stop the streaming. Currently I am achieving this using following code:
jsc is instance of JavaStreamingContext and directStream is instance of JavaPairInputDStream.
LongAccumulator poisonNotifier = sc.longAccumulator("poisonNotifier");
directStream.foreachRDD(rdd -> {
RDD<Row> rows = rdd.values().map(value -> {
if (value.equals("poison") {
poisonNotifier.add(1);
} else {
...
}
return row;
}).rdd();
});
jsc.start();
ExecutorService poisonMonitor = Executors.newSingleThreadExecutor();
poisonMonitor.execute(() -> {
while (true) {
if (poisonNotifier.value() > 0) {
jsc.stop(false, true);
break;
}
}
});
try {
jsc.awaitTermination();
} catch (InterruptedException e) {
e.printStackTrace();
}
poisonMonitor.shutdown();
Although this approach is working, this doesn't sounds like right approach to me. Is there any other better(cleaner) way to achieve the same?

how to prevent reading when hazelcast map is locked

for there are lots of data should be put into hazelcast map, I want to prevent reading from others when the data is putting into the map.
is there any way to realize it?
for example:
map a = map(1,000,000,000) // a has 1,000,000,000 elements
map b = map(2,000) // b has 200 emlemnts
i want to put all of b into a ;
the elements of b should be accessed after all of these are put into map a;
if the elements of map b haven't been put into map a entirely, the elements of map b couldn't be accessed.
use case:
map a ={1,2,3,4,5}
map b ={a,b,c,d,e}
print a // result {1,2,3,4,5}
foreach item in b
a.put item
print a // result {1,2,3,4,5}
end foreach
print a //result {1,2,3,4,5,a,b,c,d,e}
i want to merge these two maps.while, map b's elements couldn't be accessed via map a before merging finished.
my solutions
thank all the people for their help.
after reading the hazelcast manual, I choose the transactionalMap to resolve this problem.
transactionalMap is READ_COMMITED islate. it could suspend reading map(1) threads when the transaction is updating map(1).
``` java
static Runnable tx = new Runnable() {
#Override
public void run() {
try {
logger.info("start transaction...");
TransactionContext txCxt = hz.newTransactionContext();
txCxt.beginTransaction();
TransactionalMap<Object, Object> map = txCxt.getMap("map");
try {
logger.info("before put map(1)");
Thread.sleep(300);
map.put("1", "1"); // reader1 is blocked
logger.info("after put map(1)");
Thread.sleep(500);
map.put("2", "2"); // reader2 is blocked
logger.info("after put map(2)");
Thread.sleep(500);
txCxt.commitTransaction();
logger.info("transaction committed");
} catch (RuntimeException t) {
txCxt.rollbackTransaction();
throw t;
}
Thread.sleep(500);
} catch (InterruptedException e) {
e.printStackTrace();
} finally {
logger.info("Finished testmap size:{}, testmap(1):{}, testmap(2):{} ", testmap.size(), testmap.get("1"),
testmap.get("2"));
Hazelcast.shutdownAll();
logger.info("system exit.");
System.exit(0);
}
}
};
```
What's your motivation / use-case? You can use transactions, but that could have a bad impact on performance. Alternatively you could use manual locking - see ILock.
However both these techniques should be used as a last-resort - when you have no chance to design your application differently.
One way to achieve this is by locking the segments in Map b while adding to it. Once pushing the entries to Map a is complete, you can unlock the segments.
There will be performance implications with this methods though as it requires an extra step of locking/unlocking.

Resources