Spark Streaming - textFileStreaming - apache-spark

I would like read files from a directory using textFileStream. My instruction is:
JavaDStream<String> lines = ssc.textFileStream("file:///C:/cdr/").cache();
By when I add files that directory, my application cant read the file.
My code is:
public static void main(String[] args) throws Exception {
SparkConf sparkConf = new SparkConf().setAppName("JavaNetworkWordCount").setMaster("local[2]");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1));
JavaDStream<String> lines = ssc.textFileStream("file:///C:/cdr/").cache();
JavaDStream<String> words = lines.flatMap(x -> Arrays.asList(SPACE.split(x)).iterator());
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(s -> new Tuple2<>(s, 1))
.reduceByKey((i1, i2) -> i1 + i2);
wordCounts.print();
ssc.start();
ssc.awaitTermination();
}
thanks for help me.

Related

Spark saveAsTextFile overwrites file after each batch

I am currently trying to use Spark streaming to get input from a Kafka topic and hence save that input in a Json file. I got so far, that I can save my InputDStream as a textFile, but the Problem is, after each batch-process the File gets overwritten, and it seems like I cannot do anything about this.
Is there a method or config option at all to change this?
I tried spark.files.overwrite ,false
but it did not work.
My Code is:
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("local-test").setMaster("local[*]")
.set("spark.shuffle.service.enabled", "false")
.set("spark.dynamicAllocation.enabled", "false")
.set("spark.io.compression.codec", "snappy")
.set("spark.rdd.compress", "true").set("spark.executor.instances","4").set("spark.executor.memory","6G")
.set("spark.executor.cores","6")
.set("spark.cores.max","8")
.set("spark.driver.memory","2g")
.set("spark.files.overwrite","false");
JavaStreamingContext ssc = new JavaStreamingContext(conf, Durations.seconds(4));
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", "xxxxx");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", "ID2");
List<String> topics = Arrays.asList("LEGO_MAX");
JavaInputDStream<ConsumerRecord<String, String>> stream = KafkaUtils.createDirectStream(ssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams));
JavaDStream<String> first = stream.map(record -> (record.value()).toString());
first.foreachRDD(rdd -> rdd.saveAsTextFile("C:\\Users\\A675866\\Hallo.txt"));
ssc.start();
try {
ssc.awaitTermination();
} catch (InterruptedException e) {
System.out.println("Failed to cut connection -> Throwing Error");
e.printStackTrace();
}
}

Spark Inactivity - Spark Driver stop reading data using TCP streaming after few minutes

I am facing strange issue and tried using Custom Receiver as well.
Issue - Spark Driver/Executor stop receiving and displaying data in stdout after 5 min of Activity. It continuously work if we keep writing data to server socket at the other end.
There is no error reported in driver or executors logs.
Code snippet
SparkConf sparkConf = new SparkConf().setMaster("spark://10.0.0.5:7077").setAppName("SmartAudioAnalytics")
.set("spark.executor.memory", "1g").set("spark.cores.max", "5").set("spark.driver.cores", "2")
.set("spark.driver.memory", "2g");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new Duration(3000));
JavaDStream<String> JsonReq1 = ssc.socketTextStream("myip", 9997, StorageLevels.MEMORY_AND_DISK_SER);
JavaDStream<String> JsonReq2 = ssc.socketTextStream("myIP", 9997, StorageLevels.MEMORY_AND_DISK_SER);
ArrayList<JavaDStream<String>> streamList = new ArrayList<JavaDStream<String>>();
streamList.add(JsonReq1);
JavaDStream<String> UnionStream = ssc.union(JsonReq2, streamList);
UnionStream.foreachRDD(new VoidFunction<JavaRDD<String>>() {
private static final long serialVersionUID = 1L;
int total = 0;
#Override
public void call(JavaRDD<String> rdd) throws Exception {
long count = rdd.count();
total += count;
System.out.println(total);
rdd.foreach(new VoidFunction<String>() {
private static final long serialVersionUID = 1L;
#Override
public void call(String s) throws Exception {
System.out.println(s);
}
});
}
});
System.out.println(UnionStream.count());
ssc.start();
ssc.awaitTermination();
I have opened Spark UI and find all threads are working properly even after 24 hours. Please see pictures

Read data from Spark checkpoint directory

I want to read values from spark checkpoint directory .
Does checkpoint only stores data in HDFS?
I want to check actually the data exists in checkpoint or not.I am using my local machine to run Spark and test to understand the concept.
public static JavaStreamingContext createContext(){
SparkConfsparkConf = new SparkConf().setAppName("SparkStreaming");
sparkConf.setMaster("local[2]");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(20));
jssc.checkpoint("C:\\Users\\Desktop\\test");
JavaDStream<String> customReceiverStream = jssc.receiverStream(new
JavaCustomReceiver(MYSQL_DRIVER,
MYSQL_CONNECTION_URL,MYSQL_USERNAME,MYSQL_PWD));
return jssc;
}
public static void main(String[] args) throws InterruptedException {
Function0<JavaStreamingContext> createContextFunc = new Function0<JavaStreamingContext>() {
#Override
public JavaStreamingContext call() {
return createContext();
}
};
JavaStreamingContext streamingContext = JavaStreamingContext.getOrCreate("C:\\Users\\dhala\\Desktop\\test", createContextFunc);
System.out.println(streamingContext.toString());
System.out.println(streamingContext.sparkContext().getCheckpointDir());
streamingContext.start();
streamingContext.awaitTermination();
I want to read from the checkpoint dir..How do I find the actuall value stored in checkpoints

Is it good practice to open several Kafka streams in one Spark Context?

We have several applications that follow the same logic and patterns, and would like to know if it's good practice to open several streams in one spark context. So the main application to submit, would have something of this sort;
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("test-app");
conf.set("log4j.configuration", "\\log4j.properties");
JavaStreamingContext ssc = new JavaStreamingContext(conf, new Duration(20));
// Iterate streams
for (RealtimeApplication app : realtimeApplications)
{
app.execute(ssc);
}
// Trigger!
ssc.start();
// Await stopping of the service...
ssc.awaitTermination();
Then, in the implementation of the abstract method execute(JavaStreamingContext ssc) you would have the following code...
JavaPairReceiverInputDStream<String, String> kafkaStream = KafkaUtils.createStream(ssc, this.getZkQuorum(), this.getSparkGroup(), topicsSet);
JavaDStream<String> lines = kafkaStream.map(new Function<Tuple2<String, String>, String>() {
#Override
public String call(Tuple2<String, String> tuple2) {
// Extract transation
String value = tuple2._2();
// Do something here...
String result = executeSomething(value);
return result;
}
});
Is this something to be considered wrong in Spark development?
I would rather share your logic through RDD like
JavaDStream<String> lines1 = kafkaStream.map(new Function<Tuple2<String, String>, String>() {...});
JavaDStream<String> lines2 = kafkaStream.map(new Function<Tuple2<String, String>, String>() {...});
JavaDStream<String> lines3 = kafkaStream.map(new Function<Tuple2<String, String>, String>() {...});
With one source stream

How to store and read data from Spark PairRDD

Spark PairRDD has the option to save the file.
JavaRDD<String> baseRDD = context.parallelize(Arrays.asList("This", "is", "dummy", "data"));
JavaPairRDD<String, Integer> myPairRDD =
baseRDD.mapToPair(new PairFunction<String, String, Integer>() {
#Override
public Tuple2<String, Integer> call(String input) throws Exception {
// TODO Auto-generated method stub
return new Tuple2<String, Integer>(input, input.length());
}
});
myPairRDD.saveAsTextFile("path");
Spark context textfile reads the data to JavaRDD only.
How to reconstruct the PairRDD directly from source?
Note:
Possible approach is to read the data to JavaRDD<String> and construct JavaPairRDD.
But with huge data it is taking considerable amount of resources.
Storing this intermediate file in non-text format is also fine.
Execution environment - JRE 1.7
You can save them as object file if you don't mind result file not being human readable.
save file:
myPairRDD.saveAsObjectFile(path);
and then you can read pairs like this:
JavaPairRDD.fromJavaRDD(sc.objectFile(path))
EDIT:
working example:
JavaRDD<String> rdd = sc.parallelize(Lists.newArrayList("1", "2"));
rdd.mapToPair(p -> new Tuple2<>(p, p)).saveAsObjectFile("c://example");
JavaPairRDD<String, String> pairRDD
= JavaPairRDD.fromJavaRDD(sc.objectFile("c://example"));
pairRDD.collect().forEach(System.out::println);
Storing the Spark PairRDD in Sequence file works well in this scenario.
JavaRDD<String> baseRDD = context.parallelize(Arrays.asList("This", "is", "dummy", "data"));
JavaPairRDD<Text, IntWritable> myPairRDD = baseRDD.mapToPair(new PairFunction<String, Text, IntWritable>() {
#Override
public Tuple2<Text, IntWritable> call(String input) throws Exception {
// TODO Auto-generated method stub
return new Tuple2<Text, IntWritable>(new Text(input), new IntWritable(input.length()));
}
});
myPairRDD.saveAsHadoopFile(path , Text.class, IntWritable.class,
SequenceFileOutputFormat.class);
JavaPairRDD<Text, IntWritable> newbaseRDD =
context.sequenceFile(path , Text.class, IntWritable.class);
// Verify the data
System.out.println(myPairRDD.collect());
newbaseRDD.foreach(new VoidFunction<Tuple2<Text, IntWritable>>() {
#Override
public void call(Tuple2<Text, IntWritable> arg0) throws Exception {
System.out.println(arg0);
}
});
As suggested by user52045, following code works with Java 8.
myPairRDD.saveAsObjectFile(path);
JavaPairRDD<String, String> objpairRDD = JavaPairRDD.fromJavaRDD(context.objectFile(path));
objpairRDD.collect().forEach(System.out::println);
Example using Scala:
Reading text file & save it as Object file format
val ordersRDD = sc.textFile("/home/cloudera/orders.txt");
ordersRDD.count();
ordersRDD.saveAsObjectFile("orders_save_obj");
Reading object file & save it as text file format:
val ordersRDD = sc.objectFile[String]("/home/cloudera/orders.txt");
ordersRDD.count();
ordersRDD.saveAsTextFile("orders_save_text");

Resources