inserting json into cassandra table using spark streaming - cassandra

I am using spark streaming to pull some data from kafka to store in cassandra. The data coming from kafka is in json format and looks like this
{"message":"testing from kafka appender with
exception","loggerName":"com...KafkaAppenderTest","params":null,"complete":"fake
exception"}
Here is my code to create a stream from kafka messages
JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream(jssc, String.class, String.class,
StringDecoder.class, StringDecoder.class, kafkaParams, topicsSet);
map the stream to pull out the json message and parse to form a LogEvent object
messages.map(new Function<Tuple2<String, String>, LogEvent>() {
#Override
public LogEvent call(Tuple2<String, String> v1) throws Exception {
Map<String, Object> map = mapper.readValue(v1._2, new TypeReference<Map<String, Object>>() {
});
return new LogEvent(map);
}
})
finally, write each rdd to cassandra
.foreach(new Function2<JavaRDD<LogEvent>, Time, Void>() {
#Override
public Void call(JavaRDD<LogEvent> rdd, Time v2) throws Exception {
javaFunctions(rdd).writerBuilder("myks", "logs", mapToRow(LogEvent.class)).saveToCassandra();
return null;
}
})
This works fine but I would like to not have to convert the json string to LogEvent object, instead, I want to just pass the json string to cassandra and utilize its json parsing functionality to insert data into the table directly from the json. This way, I dont have to know what is coming in the json and as long as the column names match, the data will/should get mapped to the table. Is there a way to do that?

Related

Apache Spark -- Data Grouping and Execution in worker nodes

We are getting live machine data as json and we get this data from RabbitMQ. below is a sample of the json,
{"DeviceId":"MAC-1001","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:35","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1001","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:36","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1002","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:37","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1002","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:38","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
The data is windowed for duration of 'X' minutes and then below is what we want to achieve
Group the data by deviceId, this is done but not sure if we can get a DataSet
We want to loop through the above grouped data and execute for aggregation logic for each device using the foreachPartition so that the code is executed within worker nodes.
Please correct me if my thought process is wrong here.
Our earlier code was collecting the data,looping through the RDD's,convert them to DataSet and applying aggregation logic on the DataSet using Spark SqlContext api's.
When doing load testing we saw 90% of the processing was happening in Master node and after a while the cpu usage spiked to 100% and the process bombed out.
So we are now trying to re-engineer the whole process to execute maximum of logic in worker nodes.
Below is the code so far that actually works in worker node but we are yet to get a DataSet for aggregating Logic
public static void main(String[] args) {
try {
mconf = new SparkConf();
mconf.setAppName("OnPrem");
mconf.setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(mconf);
jssc = new JavaStreamingContext(sc, Durations.seconds(60));
SparkSession spksess = SparkSession.builder().appName("Onprem").getOrCreate();
//spksess.sparkContext().setLogLevel("ERROR");
Map<String, String> rabbitMqConParams = new HashMap<String, String>();
rabbitMqConParams.put("hosts", "localhost");
rabbitMqConParams.put("userName", "guest");
rabbitMqConParams.put("password", "guest");
rabbitMqConParams.put("vHost", "/");
rabbitMqConParams.put("durable", "true");
List<JavaRabbitMQDistributedKey> distributedKeys = new LinkedList<JavaRabbitMQDistributedKey>();
distributedKeys.add(new JavaRabbitMQDistributedKey(QUEUE_NAME, new ExchangeAndRouting(EXCHANGE_NAME, "fanout", ""), rabbitMqConParams));
Function<Delivery, String> messageHandler = new Function<Delivery, String>() {
public String call(Delivery message) {
return new String(message.getBody());
}
};
JavaInputDStream<String> messages = RabbitMQUtils.createJavaDistributedStream(jssc, String.class, distributedKeys, rabbitMqConParams, messageHandler);
JavaDStream<String> machineDataRDD = messages.window(Durations.minutes(2),Durations.seconds(60)); //every 60 seconds one RDD is Created
machineDataRDD.print();
JavaPairDStream<String, String> pairedData = machineDataRDD.mapToPair(s -> new Tuple2<String, String>(getMap(s).get("DeviceId").toString(), s));
JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();
groupedData.foreachRDD(new VoidFunction<JavaPairRDD<String,Iterable<String>>>(){
#Override
public void call(JavaPairRDD<String, Iterable<String>> data) throws Exception {
data.foreachPartition(new VoidFunction<Iterator<Tuple2<String,Iterable<String>>>>(){
#Override
public void call(Iterator<Tuple2<String, Iterable<String>>> data) throws Exception {
while(data.hasNext()){
LOGGER.error("Machine Data == >>"+data.next());
}
}
});
}
});
jssc.start();
jssc.awaitTermination();
}
catch (Exception e)
{
e.printStackTrace();
}
The below grouping code gives us a Iterable of string for a Device , ideally we would like to get a DataSet
JavaPairDStream<String, String> pairedData = machineDataRDD.mapToPair(s -> new Tuple2<String, String>(getMap(s).get("DeviceId").toString(), s));
JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();
Important thing for me is the looping using foreachPartition so that code executing gets pushed to Worker Nodes.
After looking through more code samples and guidelines sqlcontext , sparksession are not serialized and available on the worker nodes , so we will be changing the strategy of not trying to build a dataset withing foreachpartition loop.

java Spark streaming to cassandra

Goal: Read kafka with spark streaming and store data in cassandra
By: Java Spark cassandra connector 1.6
Data input: simple json line object {"id":"1","field1":"value1}
i´ve a java class to read from kafka by spark streaming, processing the data read and then store it in cassandra.
here is the main code:
**JavaPairReceiverInputDStream**<String, String> messages =
KafkaUtils.createStream(ssc,
targetKafkaServerPort, targetTopic, topicMap);
**JavaDStream** list = messages.map(new Function<Tuple2<String,String>,List<Object>>(){
public List<Object> call( Tuple2<String,String> tuple2){
List<Object> **list**=new ArrayList<Object>();
Gson gson = new Gson();
MyClass myclass = gson.fromJson(tuple2._2(), MyClass.class);
myclass.setNewData("new_data");
String jsonInString = gson.toJson(myclass);
list.add(jsonInString);
return list;
}
});
The next code is incorrect:
**javaFunctions**(list)
.writerBuilder("schema", "table", mapToRow(JavaDStream.class))
.saveToCassandra();
Because "javaFunctions" method expect a JavaRDD object and "list" is a JavaDStream...
I´d need to cast JavaDStream to JavaRDD but I don´t find the right way...
Any help?
Let's use
import static com.datastax.spark.connector.japi.CassandraStreamingJavaUtil.* instead of com.datastax.spark.connector.japi.CassandraJavaUtil.*
ummmm not really...What I´ve done is use a foreachRDD after create the dsStream:
dStream.foreachRDD(new Function<JavaRDD<MyObject>, Void>() {
#Override
public Void call(JavaRDD<MyObject> rdd) throws Exception {
if (rdd != null) {
javaFunctions(rdd)
.writerBuilder("schema", "table", mapToRow(MyObject.class))
.saveToCassandra();
logging(" --> Saved data to cassandra",1,null);
}
return null;
}
});
Hope to be usefull...

Refresh RDD LKP Table Via Spark Streaming

I have a spark streaming application that is pulling data from a kafka topic and then correlating that data with a dimensional lookup In spark. While I have managed to load in the dimensional lookup table from hive on the first run into an RDD; I want this dimensional lkp rdd to be refreshed every one hour.
From my understanding SparkStreaming is effectively a spark scheduler, I am wondering if its possible to create a JavaDStream of the Dimensional lookup RDD, and have it refresh on a scheduled interval using Spark Streaming. The problem is I have no idea how to approach this, from my understanding an RDD is immutable, meaning is it even possible to refresh a JavaDStream in spark and join it with JavaDStream that is running on a different schedule?
Current Code:
System.out.println("Loading from Hive Tables...");
//Retrieve Dimensional LKP data from hive table into RDD (CombineIVAPP function retrieves the data from Hive and performs initial joins)
final JavaPairRDD<String, Tuple2<modelService,modelONT>> LKP_IVAPP_DIM = CombineIVAPP();
LKP_IVAPP_DIM.cache();
System.out.println("Mapped Tables to K/V Pairs");
//Kafka Topic settings
Map<String, Integer> topicMap = new HashMap<String, Integer>();
topicMap.put(KAFKA_TOPIC,KAFKA_PARA);
//Begin to stream from Kafka Topic
JavaPairReceiverInputDStream<String, String> messages = KafkaUtils.createStream(
jssc, ZOOKEEPER_URL, KAFKA_GROUPID, topicMap);
//Map messages from Kafka Stream to Tuple
JavaDStream<String> json = messages.map(
new Function<Tuple2<String, String>, String>() {
#Override
public String call(Tuple2<String, String> message) {
return message._2();
}
}
);
//Map kafka JSON string to K/V RDD
JavaPairDStream<String, modelAlarms> RDD_ALARMS = json.mapToPair(new KafkaToRDD());
//Remove the null values
JavaPairDStream<String, modelAlarms> RDD_ALARMS_FILTERED = RDD_ALARMS.filter(new Function<Tuple2<String, modelAlarms>, Boolean>() {
#Override
public Boolean call(Tuple2<String, modelAlarms> item) {
return item != null;
}
});
//Join Alarm data from Kafka topic with hive lkp table LKP_IVAPP_DIM
JavaPairDStream<String, Tuple2<modelAlarms,Tuple2<modelService,modelONT>>> RDD_ALARMS_JOINED = RDD_ALARMS_FILTERED.transformToPair(new Function<JavaPairRDD<String, modelAlarms>, JavaPairRDD<String, Tuple2<modelAlarms, Tuple2<modelService,modelONT>>>>() {
#Override
public JavaPairRDD<String, Tuple2<modelAlarms, Tuple2<modelService,modelONT>>> call(JavaPairRDD<String, modelAlarms> v1) throws Exception {
return v1.join(LKP_IVAPP_DIM);
}
});

No output after using the Spark Streaming

HashMap<String, String> kafkaParams = new HashMap<>();
kafkaParams.put("metadata.broker.list", "localhost:9092");
String topics = "test4";
HashSet<String> topicsSet = new HashSet<String>(Arrays.asList(topics.split(" ")));
JavaDStream<String> stream1 = KafkaUtils.createDirectStream(jssc, String.class, String.class, StringDecoder.class,
StringDecoder.class, kafkaParams, topicsSet)
.transformToPair(new Function<JavaPairRDD<String, String>, JavaPairRDD<String, String>>() {
#Override
public JavaPairRDD<String, String> call(JavaPairRDD<String, String> rdd) {
rdd.saveAsTextFile("output");
return rdd;
}
}).map(new Function<Tuple2<String, String>, String>() {
#Override
public String call(Tuple2<String, String> kv) {
return kv._2();
}
});
stream1.print();
jssc.start();
jssc.awaitTermination();
Cross checked that there is valid data in the topic "test4".
I am expecting strings that are streamed from the kafka cluster, to be printed in the console.No exceptions in console,but also no output.
Anything I'm missing here?
Have you tried to produce data in your topic after the streaming application is started?
By default direct stream use the configuration auto.offset.reset = largest, it means that when there is no initial offset it automatically reset to the largest offset, so basically you will be able to read only the new messages entering in the topic after the streaming application is started.
As ccheneson says, it could be because you're missing .start() and .awaitTermination()
Or it could be because transformations in Spark are lazy, which means that you need to add an action to get the results. e.g.
stream1.print();
Or it could be because the map is being performed on the executor(s), so the output would be in the executor's log, rather than the driver's log.

How to store and read data from Spark PairRDD

Spark PairRDD has the option to save the file.
JavaRDD<String> baseRDD = context.parallelize(Arrays.asList("This", "is", "dummy", "data"));
JavaPairRDD<String, Integer> myPairRDD =
baseRDD.mapToPair(new PairFunction<String, String, Integer>() {
#Override
public Tuple2<String, Integer> call(String input) throws Exception {
// TODO Auto-generated method stub
return new Tuple2<String, Integer>(input, input.length());
}
});
myPairRDD.saveAsTextFile("path");
Spark context textfile reads the data to JavaRDD only.
How to reconstruct the PairRDD directly from source?
Note:
Possible approach is to read the data to JavaRDD<String> and construct JavaPairRDD.
But with huge data it is taking considerable amount of resources.
Storing this intermediate file in non-text format is also fine.
Execution environment - JRE 1.7
You can save them as object file if you don't mind result file not being human readable.
save file:
myPairRDD.saveAsObjectFile(path);
and then you can read pairs like this:
JavaPairRDD.fromJavaRDD(sc.objectFile(path))
EDIT:
working example:
JavaRDD<String> rdd = sc.parallelize(Lists.newArrayList("1", "2"));
rdd.mapToPair(p -> new Tuple2<>(p, p)).saveAsObjectFile("c://example");
JavaPairRDD<String, String> pairRDD
= JavaPairRDD.fromJavaRDD(sc.objectFile("c://example"));
pairRDD.collect().forEach(System.out::println);
Storing the Spark PairRDD in Sequence file works well in this scenario.
JavaRDD<String> baseRDD = context.parallelize(Arrays.asList("This", "is", "dummy", "data"));
JavaPairRDD<Text, IntWritable> myPairRDD = baseRDD.mapToPair(new PairFunction<String, Text, IntWritable>() {
#Override
public Tuple2<Text, IntWritable> call(String input) throws Exception {
// TODO Auto-generated method stub
return new Tuple2<Text, IntWritable>(new Text(input), new IntWritable(input.length()));
}
});
myPairRDD.saveAsHadoopFile(path , Text.class, IntWritable.class,
SequenceFileOutputFormat.class);
JavaPairRDD<Text, IntWritable> newbaseRDD =
context.sequenceFile(path , Text.class, IntWritable.class);
// Verify the data
System.out.println(myPairRDD.collect());
newbaseRDD.foreach(new VoidFunction<Tuple2<Text, IntWritable>>() {
#Override
public void call(Tuple2<Text, IntWritable> arg0) throws Exception {
System.out.println(arg0);
}
});
As suggested by user52045, following code works with Java 8.
myPairRDD.saveAsObjectFile(path);
JavaPairRDD<String, String> objpairRDD = JavaPairRDD.fromJavaRDD(context.objectFile(path));
objpairRDD.collect().forEach(System.out::println);
Example using Scala:
Reading text file & save it as Object file format
val ordersRDD = sc.textFile("/home/cloudera/orders.txt");
ordersRDD.count();
ordersRDD.saveAsObjectFile("orders_save_obj");
Reading object file & save it as text file format:
val ordersRDD = sc.objectFile[String]("/home/cloudera/orders.txt");
ordersRDD.count();
ordersRDD.saveAsTextFile("orders_save_text");

Resources