I have an SPARK application that uses TwitterUtils to read a Twitter stream and uses a map and a foreachRDD on the stream to put Twitter messages into a database. That all works great.
My question: What is the most appropriate way to detach from the Twitter stream once everything is running. Suppose I want to only collect 1000 messages or run the collection for 60 seconds.
The code is as follows:
SparkConf sparkConf = new SparkConf().setAppName("Java spark twitter stream");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new Duration(1000));
JavaDStream<Status> tweets = TwitterUtils.createStream(ssc, filters);
JavaDStream<String> statuses = tweets.map(
new Function<Status, String>() {
public String call(Status status) {
//combine the strings here.
GeoLocation geoLocation = status.getGeoLocation();
if (geoLocation != null) {
String text = status.getText().replaceAll("[\r\n]", " ");
String line = geoLocation.getLongitude() + ",,,,"
+ geoLocation.getLatitude() + ",,,,"
+ status.getCreatedAt().getTime()
+ ",,,," + status.getUser().getId()
+ ",,,," + text;
return line;
} else {
return null;
}
}
}
).filter(new Function<String, Boolean>() {
public Boolean call(String input) {
return input != null;
}
});
statuses.print();
statuses.foreachRDD(new Function2<JavaRDD<String>, Time, Void>() {
#Override
public Void call(JavaRDD<String> rdd, Time time) {
SQLContext sqlContext
= JavaSQLContextSingleton
.getInstance(rdd.context());
sqlContext.setConf("spark.sql.tungsten.enabled", "false");
JavaRDD<Row> tweetRowRDD
= rdd.map(new TweetMapLoadFunction());
DataFrame statusesDataFrame
= sqlContext
.createDataFrame(
tweetRowRDD,
tweetSchema.createTweetStructType());
return null;
}
});
ssc.start();
ssc.awaitTermination();
This is straight from the documentation:
The processing can be manually stopped using streamingContext.stop().
Points to remember:
Once a context has been started, no new streaming computations can be set up or added to it.
Once a context has been stopped, it cannot be restarted.
Only one StreamingContext can be active in a JVM at the same time.
stop() on StreamingContext also stops the SparkContext. To stop only the StreamingContext, set the optional parameter of stop() called stopSparkContext to false.
A SparkContext can be re-used to create multiple StreamingContexts, as long as the previous StreamingContext is stopped (without stopping the SparkContext) before the next StreamingContext is created.
Related
I am trying to create Apache Spark job to consume Kafka messages submitted in to a topic. To submit messages to the topic using kafka-console-producer as below.
./kafka-console-producer.sh --broker-list kafka1:9092 --topic my-own-topic
To read messages I am using spark-streaming-kafka-0-10_2.11 library. With the library manage to to read the total counts of the messages received to the topic. But I can not read ConsumerRecord object in the stream and when I try to read it entire application get blocked and can not print it in to the console. Note I am running Kafka, Zookeeper and Spark in docker containers. Help would be greatly appreciated.
import java.util.Arrays;
import java.util.Collection;
import java.util.HashMap;
import java.util.Map;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.apache.spark.SparkConf;
import org.apache.spark.TaskContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka010.ConsumerStrategies;
import org.apache.spark.streaming.kafka010.HasOffsetRanges;
import org.apache.spark.streaming.kafka010.KafkaUtils;
import org.apache.spark.streaming.kafka010.LocationStrategies;
import org.apache.spark.streaming.kafka010.OffsetRange;
public class SparkKafkaStreamingJDBCExample {
public static void main(String[] args) {
// Start a spark instance and get a context
SparkConf conf =
new SparkConf().setAppName("Study Spark").setMaster("spark://spark-master:7077");
// Setup a streaming context.
JavaStreamingContext streamingContext = new JavaStreamingContext(conf, Durations.seconds(3));
// Create a map of Kafka params
Map<String, Object> kafkaParams = new HashMap<String, Object>();
// List of Kafka brokers to listen to.
kafkaParams.put("bootstrap.servers", "kafka1:9092");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", "use_a_separate_group_id_for_each_stream");
// Do you want to start from the earliest record or the latest?
kafkaParams.put("auto.offset.reset", "earliest");
kafkaParams.put("enable.auto.commit", true);
// List of topics to listen to.
Collection<String> topics = Arrays.asList("my-own-topic");
// Create a Spark DStream with the kafka topics.
final JavaInputDStream<ConsumerRecord<String, String>> stream =
KafkaUtils.createDirectStream(streamingContext, LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams));
System.out.println("Study Spark Example Starting ....");
stream.foreachRDD(rdd -> {
if (rdd.isEmpty()) {
System.out.println("RDD Empty " + rdd.count());
return;
} else {
System.out.println("RDD not empty " + rdd.count());
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
System.out.println("Partition Id " + TaskContext.getPartitionId());
OffsetRange o = offsetRanges[TaskContext.getPartitionId()];
System.out.println("Topic " + o.topic());
System.out.println("Creating RDD !!!");
JavaRDD<ConsumerRecord<String, String>> r =
KafkaUtils.createRDD(streamingContext.sparkContext(), kafkaParams, offsetRanges,
LocationStrategies.PreferConsistent());
System.out.println("Count " + r.count());
//Application stuck from here onwards ...
ConsumerRecord<String, String> first = r.first();
System.out.println("First taken");
System.out.println("First value " + first.value());
}
});
System.out.println("Stream context starting ...");
// Start streaming.
streamingContext.start();
System.out.println("Stream context started ...");
try {
System.out.println("Stream context await termination ...");
streamingContext.awaitTermination();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
Sample output given below also.
Study Spark Example Starting ....
Stream context starting ...
Stream context started ...
Stream context await termination ...
RDD Empty 0
RDD Empty 0
RDD Empty 0
RDD Empty 0
RDD not empty 3
Partition Id 0
Topic my-own-topic
Creating RDD !!!
I am facing strange issue and tried using Custom Receiver as well.
Issue - Spark Driver/Executor stop receiving and displaying data in stdout after 5 min of Activity. It continuously work if we keep writing data to server socket at the other end.
There is no error reported in driver or executors logs.
Code snippet
SparkConf sparkConf = new SparkConf().setMaster("spark://10.0.0.5:7077").setAppName("SmartAudioAnalytics")
.set("spark.executor.memory", "1g").set("spark.cores.max", "5").set("spark.driver.cores", "2")
.set("spark.driver.memory", "2g");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new Duration(3000));
JavaDStream<String> JsonReq1 = ssc.socketTextStream("myip", 9997, StorageLevels.MEMORY_AND_DISK_SER);
JavaDStream<String> JsonReq2 = ssc.socketTextStream("myIP", 9997, StorageLevels.MEMORY_AND_DISK_SER);
ArrayList<JavaDStream<String>> streamList = new ArrayList<JavaDStream<String>>();
streamList.add(JsonReq1);
JavaDStream<String> UnionStream = ssc.union(JsonReq2, streamList);
UnionStream.foreachRDD(new VoidFunction<JavaRDD<String>>() {
private static final long serialVersionUID = 1L;
int total = 0;
#Override
public void call(JavaRDD<String> rdd) throws Exception {
long count = rdd.count();
total += count;
System.out.println(total);
rdd.foreach(new VoidFunction<String>() {
private static final long serialVersionUID = 1L;
#Override
public void call(String s) throws Exception {
System.out.println(s);
}
});
}
});
System.out.println(UnionStream.count());
ssc.start();
ssc.awaitTermination();
I have opened Spark UI and find all threads are working properly even after 24 hours. Please see pictures
We are getting live machine data as json and we get this data from RabbitMQ. below is a sample of the json,
{"DeviceId":"MAC-1001","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:35","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1001","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:36","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1002","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:37","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1002","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:38","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
The data is windowed for duration of 'X' minutes and then below is what we want to achieve
Group the data by deviceId, this is done but not sure if we can get a DataSet
We want to loop through the above grouped data and execute for aggregation logic for each device using the foreachPartition so that the code is executed within worker nodes.
Please correct me if my thought process is wrong here.
Our earlier code was collecting the data,looping through the RDD's,convert them to DataSet and applying aggregation logic on the DataSet using Spark SqlContext api's.
When doing load testing we saw 90% of the processing was happening in Master node and after a while the cpu usage spiked to 100% and the process bombed out.
So we are now trying to re-engineer the whole process to execute maximum of logic in worker nodes.
Below is the code so far that actually works in worker node but we are yet to get a DataSet for aggregating Logic
public static void main(String[] args) {
try {
mconf = new SparkConf();
mconf.setAppName("OnPrem");
mconf.setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(mconf);
jssc = new JavaStreamingContext(sc, Durations.seconds(60));
SparkSession spksess = SparkSession.builder().appName("Onprem").getOrCreate();
//spksess.sparkContext().setLogLevel("ERROR");
Map<String, String> rabbitMqConParams = new HashMap<String, String>();
rabbitMqConParams.put("hosts", "localhost");
rabbitMqConParams.put("userName", "guest");
rabbitMqConParams.put("password", "guest");
rabbitMqConParams.put("vHost", "/");
rabbitMqConParams.put("durable", "true");
List<JavaRabbitMQDistributedKey> distributedKeys = new LinkedList<JavaRabbitMQDistributedKey>();
distributedKeys.add(new JavaRabbitMQDistributedKey(QUEUE_NAME, new ExchangeAndRouting(EXCHANGE_NAME, "fanout", ""), rabbitMqConParams));
Function<Delivery, String> messageHandler = new Function<Delivery, String>() {
public String call(Delivery message) {
return new String(message.getBody());
}
};
JavaInputDStream<String> messages = RabbitMQUtils.createJavaDistributedStream(jssc, String.class, distributedKeys, rabbitMqConParams, messageHandler);
JavaDStream<String> machineDataRDD = messages.window(Durations.minutes(2),Durations.seconds(60)); //every 60 seconds one RDD is Created
machineDataRDD.print();
JavaPairDStream<String, String> pairedData = machineDataRDD.mapToPair(s -> new Tuple2<String, String>(getMap(s).get("DeviceId").toString(), s));
JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();
groupedData.foreachRDD(new VoidFunction<JavaPairRDD<String,Iterable<String>>>(){
#Override
public void call(JavaPairRDD<String, Iterable<String>> data) throws Exception {
data.foreachPartition(new VoidFunction<Iterator<Tuple2<String,Iterable<String>>>>(){
#Override
public void call(Iterator<Tuple2<String, Iterable<String>>> data) throws Exception {
while(data.hasNext()){
LOGGER.error("Machine Data == >>"+data.next());
}
}
});
}
});
jssc.start();
jssc.awaitTermination();
}
catch (Exception e)
{
e.printStackTrace();
}
The below grouping code gives us a Iterable of string for a Device , ideally we would like to get a DataSet
JavaPairDStream<String, String> pairedData = machineDataRDD.mapToPair(s -> new Tuple2<String, String>(getMap(s).get("DeviceId").toString(), s));
JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();
Important thing for me is the looping using foreachPartition so that code executing gets pushed to Worker Nodes.
After looking through more code samples and guidelines sqlcontext , sparksession are not serialized and available on the worker nodes , so we will be changing the strategy of not trying to build a dataset withing foreachpartition loop.
I want to read values from spark checkpoint directory .
Does checkpoint only stores data in HDFS?
I want to check actually the data exists in checkpoint or not.I am using my local machine to run Spark and test to understand the concept.
public static JavaStreamingContext createContext(){
SparkConfsparkConf = new SparkConf().setAppName("SparkStreaming");
sparkConf.setMaster("local[2]");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(20));
jssc.checkpoint("C:\\Users\\Desktop\\test");
JavaDStream<String> customReceiverStream = jssc.receiverStream(new
JavaCustomReceiver(MYSQL_DRIVER,
MYSQL_CONNECTION_URL,MYSQL_USERNAME,MYSQL_PWD));
return jssc;
}
public static void main(String[] args) throws InterruptedException {
Function0<JavaStreamingContext> createContextFunc = new Function0<JavaStreamingContext>() {
#Override
public JavaStreamingContext call() {
return createContext();
}
};
JavaStreamingContext streamingContext = JavaStreamingContext.getOrCreate("C:\\Users\\dhala\\Desktop\\test", createContextFunc);
System.out.println(streamingContext.toString());
System.out.println(streamingContext.sparkContext().getCheckpointDir());
streamingContext.start();
streamingContext.awaitTermination();
I want to read from the checkpoint dir..How do I find the actuall value stored in checkpoints
I wrote a Spark Streaming application which receives temperature values and calculates the average temperature of all time. For that i used the JavaPairDStream.updateStateByKey transaction to calculate it per device (separated by the Pair's key). For state tracking I use the StatCounter class, which holds all temperature values as doubles and re-calculates the average each stream via calling the StatCounter.mean method. Here my program:
EDITED MY WHOLE CODE: NOW USING StatCounter
JavaStreamingContext streamingContext = new JavaStreamingContext(sparkConf, Durations.seconds(1));
streamingContext.checkpoint("hdfs://server:8020/spark-history/checkpointing");
JavaReceiverInputDStream<String> ingoingStream = streamingContext.socketTextStream(serverIp, 11833);
JavaDStream<SensorData> sensorDStream = ingoingStream.map(new Function<String, SensorData>() {
public SensorData call(String json) throws Exception {
ObjectMapper om = new ObjectMapper();
return (SensorData)om.readValue(json, SensorData.class);
}
});
JavaPairDStream<String, Float> temperatureDStream = sensorDStream.mapToPair(new PairFunction<SensorData, String, Float>() {
public Tuple2<String, Float> call(SensorData sensorData) throws Exception {
return new Tuple2<String, Float>(sensorData.getIdSensor(), sensorData.getValTemp());
}
});
JavaPairDStream<String, StatCounter> statCounterDStream = temperatureDStream.updateStateByKey(new Function2<List<Float>, Optional<StatCounter>, Optional<StatCounter>>() {
public Optional<StatCounter> call(List<Float> newTemperatures, Optional<StatCounter> statsYet) throws Exception {
StatCounter stats = statsYet.or(new StatCounter());
for(float temp : newTemperatures) {
stats.merge(temp);
}
return Optional.of(stats);
}
});
JavaPairDStream<String, Double> avgTemperatureDStream = statCounterDStream.mapToPair(new PairFunction<Tuple2<String,StatCounter>, String, Double>() {
public Tuple2<String, Double> call(Tuple2<String, StatCounter> statCounterTuple) throws Exception {
String key = statCounterTuple._1();
double avgValue = statCounterTuple._2().mean();
return new Tuple2<String, Double>(key, avgValue);
}
});
avgTemperatureDStream.print();
This seems to work fine. But now to the question:
I just found an example online which also shows how to calculate a average of all time here: https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter1/total.html
They use AtmoicLongs etc. for storing the "stateful values" and update them in a forEachRDD method.
My question now is: What is the better solution for a stateful calculation of all time in Spark Streaming? Are there any advantages / disadvantages of using one or the other way? Thank you!