Spark FileStreaming issue - apache-spark

I am trying simple file streaming example using Sparkstreaming(spark-streaming_2.10,version:1.5.1)
public class DStreamExample {
public static void main(final String[] args) {
final SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("SparkJob");
sparkConf.setMaster("local[4]"); // for local
final JavaSparkContext sc = new JavaSparkContext(sparkConf);
final JavaStreamingContext ssc = new JavaStreamingContext(sc,
new Duration(2000));
final JavaDStream<String> lines = ssc.textFileStream("/opt/test/");
lines.print();
ssc.start();
ssc.awaitTermination();
}
}
When I run this code on single file or director it does not print anything from file, I see in logs its constantly polling but nothing is printed. I tried moving file to directory when this program was running.
Is there something I am missing? I tried applying map function on lines RDD that also does not work.

The API textFileStream is not supposed to read existing directory content, instead, it's purpose is to monitor the given Hadoop-compatible filesystem path for changes, files must be written into monitored location by "moving" them from another location within same file system.
In short, you are subscribing for directory changes and will receive the content of newly appeared files within the monitored location - in that state in which the file(s) appear at the moment of monitoring snapshot (which is 2000 ms duration in your case), and any further file updates will not reach the stream, only directory updates (new files) will do.
The way you can emulate updates is to create new file during your monitoring session:
import org.apache.commons.io.FileUtils;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import java.io.File;
import java.io.IOException;
import java.util.List;
public class DStreamExample {
public static void main(final String[] args) throws IOException {
final SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("SparkJob");
sparkConf.setMaster("local[4]"); // for local
final JavaSparkContext sc = new JavaSparkContext(sparkConf);
final JavaStreamingContext ssc = new JavaStreamingContext(sc,
new Duration(2000));
final JavaDStream<String> lines = ssc.textFileStream("/opt/test/");
// spawn the thread which will create new file within the monitored directory soon
Runnable r = () -> {
try {
Thread.sleep(5000);
} catch (InterruptedException e) {
e.printStackTrace();
}
try {
FileUtils.write(new File("/opt/test/newfile1"), "whatever");
} catch (IOException e) {
e.printStackTrace();
}
};
new Thread(r).start();
lines.foreachRDD((Function<JavaRDD<String>, Void>) rdd -> {
List<String> lines1 = rdd.collect();
lines1.stream().forEach(l -> System.out.println(l));
return null;
});
ssc.start();
ssc.awaitTermination();
}
}

Related

How to tail a file with rx-java?

Using the rx-java package, how can I tail a file?
code:
package tail;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.logging.Logger;
public class FileUtils {
public FileUtils() {
}
public void readFromInputStream(InputStream inputStream)
throws IOException {
StringBuilder resultStringBuilder = new StringBuilder();
try (BufferedReader br
= new BufferedReader(new InputStreamReader(inputStream))) {
String line;
while ((line = br.readLine()) != null) {
resultStringBuilder.append(line).append("\n");
}
}
System.out.println(resultStringBuilder);
}
}
I'd want to parse each line as its written to the file. First, need an equivalent to tail -f foo.log from Java.
I'm not sure about using RxJava specifically. But you can take a general Java solution and combine with RxJava. This of course depends on how you want to be informed.
One basic option is to tail the file using an open InputStream (or some stream wrapper classes or reader class).
Using a running thread, you can wait on the file with InputStream.read or sample the amount of new bytes with InputStream.available. In RxJava you can use the IO scheduler. After finishing to read an entire line, just invoke some observers from the thread.
In core Java, it could look something like:
class ObserveTask implements Runnable {
private File file;
private Listener listener;
#Override
public void run() {
try (InputStream stream = new FileInputStream(file);
BufferedReader reader = new BufferedReader(stream)) {
while (!Thread.interrupted()) {
String line = reader.readLine();
listener.onNewLine(line);
}
}
}
}
Now just run the task:
Scheduler scheduler = Schedulers.newThread();
Scheduler.Worker worker = scheduler.createWorker();
worker.schedule(new ObserveTask());
Of course this is just a basic example.
You can split the task to parts, where each run, a check is made for the bytes, while the stream remains open. Of course this is dangerous, since leaving a stream open like that might end in leaking.
A more advanced solution can include using WatchService.

Apache Spark can not read Kafka message Content

I am trying to create Apache Spark job to consume Kafka messages submitted in to a topic. To submit messages to the topic using kafka-console-producer as below.
./kafka-console-producer.sh --broker-list kafka1:9092 --topic my-own-topic
To read messages I am using spark-streaming-kafka-0-10_2.11 library. With the library manage to to read the total counts of the messages received to the topic. But I can not read ConsumerRecord object in the stream and when I try to read it entire application get blocked and can not print it in to the console. Note I am running Kafka, Zookeeper and Spark in docker containers. Help would be greatly appreciated.
import java.util.Arrays;
import java.util.Collection;
import java.util.HashMap;
import java.util.Map;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.apache.spark.SparkConf;
import org.apache.spark.TaskContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka010.ConsumerStrategies;
import org.apache.spark.streaming.kafka010.HasOffsetRanges;
import org.apache.spark.streaming.kafka010.KafkaUtils;
import org.apache.spark.streaming.kafka010.LocationStrategies;
import org.apache.spark.streaming.kafka010.OffsetRange;
public class SparkKafkaStreamingJDBCExample {
public static void main(String[] args) {
// Start a spark instance and get a context
SparkConf conf =
new SparkConf().setAppName("Study Spark").setMaster("spark://spark-master:7077");
// Setup a streaming context.
JavaStreamingContext streamingContext = new JavaStreamingContext(conf, Durations.seconds(3));
// Create a map of Kafka params
Map<String, Object> kafkaParams = new HashMap<String, Object>();
// List of Kafka brokers to listen to.
kafkaParams.put("bootstrap.servers", "kafka1:9092");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", "use_a_separate_group_id_for_each_stream");
// Do you want to start from the earliest record or the latest?
kafkaParams.put("auto.offset.reset", "earliest");
kafkaParams.put("enable.auto.commit", true);
// List of topics to listen to.
Collection<String> topics = Arrays.asList("my-own-topic");
// Create a Spark DStream with the kafka topics.
final JavaInputDStream<ConsumerRecord<String, String>> stream =
KafkaUtils.createDirectStream(streamingContext, LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams));
System.out.println("Study Spark Example Starting ....");
stream.foreachRDD(rdd -> {
if (rdd.isEmpty()) {
System.out.println("RDD Empty " + rdd.count());
return;
} else {
System.out.println("RDD not empty " + rdd.count());
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
System.out.println("Partition Id " + TaskContext.getPartitionId());
OffsetRange o = offsetRanges[TaskContext.getPartitionId()];
System.out.println("Topic " + o.topic());
System.out.println("Creating RDD !!!");
JavaRDD<ConsumerRecord<String, String>> r =
KafkaUtils.createRDD(streamingContext.sparkContext(), kafkaParams, offsetRanges,
LocationStrategies.PreferConsistent());
System.out.println("Count " + r.count());
//Application stuck from here onwards ...
ConsumerRecord<String, String> first = r.first();
System.out.println("First taken");
System.out.println("First value " + first.value());
}
});
System.out.println("Stream context starting ...");
// Start streaming.
streamingContext.start();
System.out.println("Stream context started ...");
try {
System.out.println("Stream context await termination ...");
streamingContext.awaitTermination();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
Sample output given below also.
Study Spark Example Starting ....
Stream context starting ...
Stream context started ...
Stream context await termination ...
RDD Empty 0
RDD Empty 0
RDD Empty 0
RDD Empty 0
RDD not empty 3
Partition Id 0
Topic my-own-topic
Creating RDD !!!

Cassandra cluster is not scaling. 3 Nodes are even a little faster then 6 nodes (Code and data provided)

I am using Datastax Enterprise 4.8 for testing purposes in my bachelor thesis. I am loading wheather data into the cluster (about 33 Mio rows).
The data looks something like the following
//id;unix timestamp; validity; station info; temp in °C; humidity in %
3;1950040101;5;24; 5.7000;83.0000
3;1950040102;5;24; 5.6000;83.0000
3;1950040103;5;24; 5.5000;83.0000
I know my data model is not very clean (I use decimal for the timestamp but I just wanted to try it this way).
CREATE TABLE temp{
id int,
timestamp decimal,
validity decimal,
structure decimal,
temperature float,
humidity float,
PRIMARY KEY((id),timestamp));
I roughly based it on an article on the datastax website:
https://academy.datastax.com/resources/getting-started-time-series-data-modeling
The insertion is done based on the often mentioned article on lostechies
https://lostechies.com/ryansvihla/2016/04/29/cassandra-batch-loading-without-the-batch%E2%80%8A-%E2%80%8Athe-nuanced-edition/
This is my insertion code:
import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.math.BigDecimal;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.TimeUnit;
import com.datastax.driver.core.BoundStatement;
import com.datastax.driver.core.Cluster;
import com.datastax.driver.core.ConsistencyLevel;
import com.datastax.driver.core.PreparedStatement;
import com.datastax.driver.core.ResultSet;
import com.datastax.driver.core.ResultSetFuture;
import com.datastax.driver.core.Session;
import com.datastax.driver.extras.codecs.jdk8.InstantCodec;
import com.google.common.base.Stopwatch;
import com.google.common.util.concurrent.FutureCallback;
import com.google.common.util.concurrent.Futures;
import com.google.common.util.concurrent.MoreExecutors;
public class BulkLoader {
private final int threads;
private final String[] contactHosts;
private final Cluster cluster;
private final Session session;
private final ExecutorService executor;
public BulkLoader(int threads, String... contactHosts) {
this.threads = threads;
this.contactHosts = contactHosts;
this.cluster = Cluster.builder().addContactPoints(contactHosts).build();
cluster.getConfiguration().getCodecRegistry()
.register(InstantCodec.instance);
session = cluster.newSession();
// fixed thread pool that closes on app exit
executor = MoreExecutors
.getExitingExecutorService((ThreadPoolExecutor) Executors
.newFixedThreadPool(threads));
}
public static class IngestCallback implements FutureCallback<ResultSet> {
public void onSuccess(ResultSet result) {
}
public void onFailure(Throwable t) {
throw new RuntimeException(t);
}
}
public void ingest(Iterator<Object[]> boundItemsIterator, String insertCQL)
throws InterruptedException {
final PreparedStatement statement = session.prepare(insertCQL);
while (boundItemsIterator.hasNext()) {
BoundStatement boundStatement = statement.bind(boundItemsIterator
.next());
boundStatement.setConsistencyLevel(ConsistencyLevel.QUORUM);
ResultSetFuture future = session.executeAsync(boundStatement);
Futures.addCallback(future, new IngestCallback(), executor);
}
}
public void stop() {
session.close();
cluster.close();
}
public static List<Object[]> readCSV(File csv) {
BufferedReader fileReader = null;
List<Object[]> result = new LinkedList<Object[]>();
try {
fileReader = new BufferedReader(new FileReader(csv));
String line = "";
while ((line = fileReader.readLine()) != null) {
String[] tokens = line.split(";");
if (tokens.length < 6) {
System.out.println("Unvollständig");
continue;
}
Object[] tmp = new Object[6];
tmp[0] = (int) Integer.parseInt(tokens[0]);
tmp[1] = new BigDecimal(Integer.parseInt(tokens[1]));
tmp[2] = new BigDecimal(Integer.parseInt(tokens[2]));
tmp[3] = new BigDecimal(Integer.parseInt(tokens[2]));
tmp[4] = (float) Float.parseFloat(tokens[4]);
tmp[5] = (float) Float.parseFloat(tokens[5]);
result.add(tmp);
}
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} finally {
try {
fileReader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
return result;
}
public static void main(String[] args) {
Stopwatch watch = Stopwatch.createStarted();
File folder = new File(
"C:/VirtualMachines/Kiosk/BachelorarbeitStraubinger/workspace/bulk/src/main/resources");
List<Object[]> data = new LinkedList<Object[]>();
BulkLoader loader = new BulkLoader(16, "10.2.57.38", "10.2.57.37",
"10.2.57.36", "10.2.57.35", "10.2.57.34", "10.2.57.33");
int cnt = 0;
File[] listOfFiles = folder.listFiles();
for (File file : listOfFiles) {
if (file.isFile() && file.getName().contains(".th")) {
data = readCSV(file);
cnt += data.size();
try {
loader.ingest(
data.iterator(),
"INSERT INTO wheather.temp (id, timestamp, validity, structure, temperature, humidity) VALUES(?,?,?,?,?,?)");
} catch (InterruptedException e) {
e.printStackTrace();
} finally {
System.out.println(file.getName()
+ " -> Datasets importet: " + cnt);
}
}
}
System.out.println("total time seconds = "
+ watch.elapsed(TimeUnit.SECONDS));
watch.stop();
loader.stop();
}
}
The replication factor is 3 and i run test on 6 or 3 nodes. With vNodes enabled and num_tokens = 256.
I get roughly the same insert times when running it on either cluster. Any ideas why that is?
It is likely that you're maxing out the client application / client server. If you're reading from a static file, you may benefit from breaking it up into a few pieces and running them in parallel, or even looking at Brian Hess' loader ( https://github.com/brianmhess/cassandra-loader ) or the real cassandra bulk loader ( http://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated ) , which turns the data into a series of sstables and streams those in directly. Both are likely faster than your existing code.
Physics.
You're probably maxing out the throughput your app is capable of. Normally the answer would be to have multiple clients/app servers but it looks like you are reading from a CSV. I suggest either cutting up the CSV in pieces and running multiple instances of your app or generate fake data and multiple instances of that.
Edit: I also think it's worth noting that with a data model like that, a payload size that small, and proper hardware, I'd imagine each node could be capable of 15-20K inserts/second (Not accounting for node density/compaction).

Commons Configuration2 ReloadingFileBasedConfiguration

I am trying to implement the Apache Configuration 2 in my codebase
import java.io.File;
import java.util.concurrent.TimeUnit;
import org.apache.commons.configuration2.PropertiesConfiguration;
import org.apache.commons.configuration2.builder.ConfigurationBuilderEvent;
import org.apache.commons.configuration2.builder.ReloadingFileBasedConfigurationBuilder;
import org.apache.commons.configuration2.builder.fluent.Parameters;
import org.apache.commons.configuration2.convert.DefaultListDelimiterHandler;
import org.apache.commons.configuration2.event.EventListener;
import org.apache.commons.configuration2.ex.ConfigurationException;
import org.apache.commons.configuration2.reloading.PeriodicReloadingTrigger;
import org.apache.commons.configuration2.CompositeConfiguration;
public class Test {
private static final long DELAY_MILLIS = 10 * 60 * 5;
public static void main(String[] args) {
// TODO Auto-generated method stub
CompositeConfiguration compositeConfiguration = new CompositeConfiguration();
PropertiesConfiguration props = null;
try {
props = initPropertiesConfiguration(new File("/tmp/DEV.properties"));
} catch (ConfigurationException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
compositeConfiguration.addConfiguration( props );
compositeConfiguration.addEventListener(ConfigurationBuilderEvent.ANY,
new EventListener<ConfigurationBuilderEvent>()
{
#Override
public void onEvent(ConfigurationBuilderEvent event)
{
System.out.println("Event:" + event);
}
});
System.out.println(compositeConfiguration.getString("property1"));
try {
Thread.sleep(14*1000);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
// Have a script which changes the value of property1 in DEV.properties
System.out.println(compositeConfiguration.getString("property1"));
}
protected static PropertiesConfiguration initPropertiesConfiguration(File propsFile) throws ConfigurationException {
if(propsFile.exists()) {
final ReloadingFileBasedConfigurationBuilder<PropertiesConfiguration> builder =
new ReloadingFileBasedConfigurationBuilder<PropertiesConfiguration>(PropertiesConfiguration.class)
.configure(new Parameters().fileBased()
.setFile(propsFile)
.setReloadingRefreshDelay(DELAY_MILLIS)
.setThrowExceptionOnMissing(false)
.setListDelimiterHandler(new DefaultListDelimiterHandler(';')));
final PropertiesConfiguration propsConfiguration = builder.getConfiguration();
PeriodicReloadingTrigger trigger = new PeriodicReloadingTrigger(builder.getReloadingController(),
null, 1, TimeUnit.SECONDS);
trigger.start();
return propsConfiguration;
} else {
return new PropertiesConfiguration();
}
}
}
Here is a sample code that I using to check whether the Automatic Reloading works or not. However when the underlying property file is updated, the configuration doesn't reflect it.
As per the documentation :
One important point to keep in mind when using this approach to reloading is that reloads are only functional if the builder is used as central component for accessing configuration data. The configuration instance obtained from the builder will not change automagically! So if an application fetches a configuration object from the builder at startup and then uses it throughout its life time, changes on the external configuration file become never visible. The correct approach is to keep a reference to the builder centrally and obtain the configuration from there every time configuration data is needed.
https://commons.apache.org/proper/commons-configuration/userguide/howto_reloading.html#Reloading_File-based_Configurations
This is different from what the old implementation was.
I was able to successfully execute your sample code by making 2 changes :
make the builder available globally and access the configuration from the builder :
System.out.println(builder.getConfiguration().getString("property1"));
add the listener to the builder :
`builder.addEventListener(ConfigurationBuilderEvent.ANY, new EventListener() {
public void onEvent(ConfigurationBuilderEvent event) {
System.out.println("Event:" + event);
}
});
Posting my sample program, where I was able to successfully demonstrate it
import java.io.File;
import java.util.concurrent.TimeUnit;
import org.apache.commons.configuration2.PropertiesConfiguration;
import org.apache.commons.configuration2.builder.ConfigurationBuilderEvent;
import org.apache.commons.configuration2.builder.ReloadingFileBasedConfigurationBuilder;
import org.apache.commons.configuration2.builder.fluent.Parameters;
import org.apache.commons.configuration2.event.EventListener;
import org.apache.commons.configuration2.reloading.PeriodicReloadingTrigger;
public class TestDynamicProps {
public static void main(String[] args) throws Exception {
Parameters params = new Parameters();
ReloadingFileBasedConfigurationBuilder<PropertiesConfiguration> builder =
new ReloadingFileBasedConfigurationBuilder<PropertiesConfiguration>(PropertiesConfiguration.class)
.configure(params.fileBased()
.setFile(new File("src/main/resources/override.properties")));
PeriodicReloadingTrigger trigger = new PeriodicReloadingTrigger(builder.getReloadingController(),
null, 1, TimeUnit.SECONDS);
trigger.start();
builder.addEventListener(ConfigurationBuilderEvent.ANY, new EventListener<ConfigurationBuilderEvent>() {
public void onEvent(ConfigurationBuilderEvent event) {
System.out.println("Event:" + event);
}
});
while (true) {
Thread.sleep(1000);
System.out.println(builder.getConfiguration().getString("property1"));
}
}
}
The problem with your implementation is, that the reloading is done on the ReloadingFileBasedConfigurationBuilder Object and is not being returned to the PropertiesConfiguration Object.

Spark Streaming for evaluating rules

I am planning to implement Spark Streaming for evaluating rules in real time instead of doing it from a database.
I have data ingested into Kafka topic. Data ingested is about user and his actions. I have rules like identifying users doing more than transactions 100 in 10 mins, identify user doing 0 transaction in last 5 mins.
The time interval(window) and rules vary and to some extent. I have around 100 such rules. As we specify the window when we create a stream do we need to create as many stream as the number of rules. I am not sure is it a right approach and Spark Streaming fits for this use case. As I am new Spark, I am looking for inputs to solve this efficiently.
Please find the sample code below with rules. These rules will configured by Analyst and there can be 50 rules and the time period can vary based on analyst requirements from 1 minute to 1 hour. If i have 10 different time lines do i need to create 10 different sliding windows with this approach. Can you please let me know if this is an right approach.
package com.spark.play;
import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.StreamingContext;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaPairInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.KafkaUtils;
import kafka.serializer.StringDecoder;
import scala.Tuple2;
public class SparkUserStreaming {
public static void main(String args[]){
System.out.println("Start User streaming App for Policy...");
SparkConf sparkConf = new SparkConf().setAppName("Policy").setMaster("local[2]").setJars(
JavaStreamingContext.jarOfClass(SparkUserStreaming.class)).setSparkHome("SPARK_HOME");
SparkContext sc = new SparkContext(sparkConf);
StreamingContext streamContex = new StreamingContext(sc, Durations.seconds(60));
JavaStreamingContext jssc = new JavaStreamingContext(streamContex);
Map<String, Integer> topicMap = new HashMap<String, Integer>();
topicMap.put("user", 1);
String brokers = "vchenst:9092";
String topics = "user";
HashSet<String> topicsSet = new HashSet<>(
Arrays.asList(topics.split(",")));
HashMap<String, String> kafkaParams = new HashMap<>();
kafkaParams.put("metadata.broker.list", brokers);
JavaPairInputDStream<String, String> dstream =
KafkaUtils.createDirectStream(
jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicsSet
);
JavaPairDStream<String, String> mesages = dstream.window(Durations.seconds(300), Durations.seconds(60));
JavaDStream<User> streamData = mesages.map(new Function<Tuple2<String, String>, User>() {
public User call(Tuple2<String, String> tuple2) throws Exception {
User user = new User();
//System.out.println("Line ::" + tuple2._1());
String[] dataArray = tuple2._2().split(",");
//user:arul,product:mobile,country:india,state:tn,price:1000,txid:12300086,datetime:2009-01-16 16:47:08
user.setName(dataArray[0]);
user.setProduct(dataArray[1]);
user.setCountry(dataArray[2]);
user.setState(dataArray[3]);
user.setPrice(Double.parseDouble(dataArray[4]));
user.setTxId(dataArray[5]);
user.setDateTime(dataArray[6]);
return user;
}
});
final SQLContext sqlContext = new SQLContext(sc);
streamData.foreachRDD(new Function<JavaRDD<User>, Void>() {
public Void call(JavaRDD<User> rdd) {
DataFrame userDf = sqlContext.createDataFrame(rdd, User.class);
//userDf.show();
//Rules applied for Five Minutes
//Rule 1 - Print Users making more than 2 transactions
userDf.groupBy("txId").count()
.withColumnRenamed("count", "n")
.filter("n >= 2")
.show();
//Rule 2 - Print User making tx with cost of more than 1000
userDf.filter(userDf.col("price").gt(1000)).show();
//Rule 3 - Print State in which max transaction taken place in Last 5 mins
userDf.groupBy("state").count();
return null;
}
});
jssc.start();
jssc.awaitTermination();
}
}
try to broadcast the rules to every spark stream and evaluate data from those rules .
This might help , if still not clear then let me know ill explain the architecture.

Resources