Spark streaming socket stream exemple not working - apache-spark

I'm trying to use Spark Streaming but I'm stuck at the frist exemple:
import java.util.Arrays;
import org.apache.spark.*;
import org.apache.spark.api.java.function.*;
import org.apache.spark.streaming.*;
import org.apache.spark.streaming.api.java.*;
import scala.Tuple2;
public class NetworkWordCount {
public static void main(String[] args) {
// Create a local StreamingContext with two working thread and batch interval of 1 second
SparkConf conf = new SparkConf().setMaster("local[4]").setAppName("NetworkWordCount");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));
JavaReceiverInputDStream<String> lines = jssc.socketTextStream("localhost", 9999);
JavaDStream<String> words = lines.flatMap(
new FlatMapFunction<String, String>() {
#Override public Iterable<String> call(String x) {
return Arrays.asList(x.split(" "));
}
});
// Count each word in each batch
JavaPairDStream<String, Integer> pairs = words.mapToPair(
new PairFunction<String, String, Integer>() {
#Override public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
});
JavaPairDStream<String, Integer> wordCounts = pairs.reduceByKey(
new Function2<Integer, Integer, Integer>() {
#Override public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print();
jssc.start(); // Start the computation
jssc.awaitTermination();
}
}
This piece of code is literally a copy past of the documentation.
https://spark.apache.org/docs/1.6.3/streaming-programming-guide.html
I set up one netcat server like:
nc -lk 9999
and obne nc client like:
nc localhost 9999
In which I type sentence such as:
Hello world !
Hellow world ! \n
That appears properly on the netcat server.
And yet it's not working. At every batch I just have an empty print.
21/02/16 00:36:41 INFO SocketInputDStream: Removing blocks of RDD BlockRDD[137] at socketTextStream at NetworkWordCount.java:17 of time 1613432201000 ms
21/02/16 00:36:41 INFO ReceivedBlockTracker: Deleting batches ArrayBuffer(1613432199000 ms)
21/02/16 00:36:41 INFO InputInfoTracker: remove old batch metadata: 1613432199000 ms
-------------------------------------------
Time: 1613432201000 ms
-------------------------------------------
I tried different value for the setMaster such as local[4], local[2] and local[*] but the result is the same.
Also if I run the spark streaming code before my netcat client, I can't even see the strings on the nc server.

I found the solution for my problem.
In short you need to write the messages directly in the terminal running the TCP server, you don't need another netcat client.
There is no bug or missing configuration it was just a misunderstanding of how netcat is working.
I understood from the man nc than the -k option allow netcat to manage multiple connexion but it was partly wrong.
-k When a connection is completed, listen for another one. Requires -l. When used together with the -u option, the server socket is not connected and it can receive UDP datagrams from multiple
hosts.
But it indeed doesn't mean what I thought at all. If you setup a netcat server with the -k option then it will accept multiple connection but will still process one at a time.
Which means that if you have 2 nc client, and that if you type some text into both of the client, only the text from one of them will be received by the server until you close this first connection.

Related

In apache spark how we can remove data from file once spark reads it

I am using custom spark receiver for spark streaming. In the receiver, I am reading text from the file and make its RDD.
The problem is, that data remains in file and spark custom receiver reads it again. So I want to remove data from the file to avoid duplication once spark reads it.
The receiver function looks like this
private void receive() {
try {
List<String> blocks = new ArrayList<>();
while (!isStopped()){
JavaSparkContext sc = spark.getSparkContext();
JavaRDD<String> lines = sc.textFile("src/dummy2.csv");
blocks = lines.collect();
store(blocks.iterator());
blocks.clear();
}
// Restart in an attempt to connect again when server is active again
restart("Trying to connect again");
} catch(Throwable t) {
// restart if there is any other error
restart("Error receiving data", t);
}
}

Apache Spark -- Data Grouping and Execution in worker nodes

We are getting live machine data as json and we get this data from RabbitMQ. below is a sample of the json,
{"DeviceId":"MAC-1001","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:35","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1001","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:36","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1002","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:37","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1002","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:38","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
The data is windowed for duration of 'X' minutes and then below is what we want to achieve
Group the data by deviceId, this is done but not sure if we can get a DataSet
We want to loop through the above grouped data and execute for aggregation logic for each device using the foreachPartition so that the code is executed within worker nodes.
Please correct me if my thought process is wrong here.
Our earlier code was collecting the data,looping through the RDD's,convert them to DataSet and applying aggregation logic on the DataSet using Spark SqlContext api's.
When doing load testing we saw 90% of the processing was happening in Master node and after a while the cpu usage spiked to 100% and the process bombed out.
So we are now trying to re-engineer the whole process to execute maximum of logic in worker nodes.
Below is the code so far that actually works in worker node but we are yet to get a DataSet for aggregating Logic
public static void main(String[] args) {
try {
mconf = new SparkConf();
mconf.setAppName("OnPrem");
mconf.setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(mconf);
jssc = new JavaStreamingContext(sc, Durations.seconds(60));
SparkSession spksess = SparkSession.builder().appName("Onprem").getOrCreate();
//spksess.sparkContext().setLogLevel("ERROR");
Map<String, String> rabbitMqConParams = new HashMap<String, String>();
rabbitMqConParams.put("hosts", "localhost");
rabbitMqConParams.put("userName", "guest");
rabbitMqConParams.put("password", "guest");
rabbitMqConParams.put("vHost", "/");
rabbitMqConParams.put("durable", "true");
List<JavaRabbitMQDistributedKey> distributedKeys = new LinkedList<JavaRabbitMQDistributedKey>();
distributedKeys.add(new JavaRabbitMQDistributedKey(QUEUE_NAME, new ExchangeAndRouting(EXCHANGE_NAME, "fanout", ""), rabbitMqConParams));
Function<Delivery, String> messageHandler = new Function<Delivery, String>() {
public String call(Delivery message) {
return new String(message.getBody());
}
};
JavaInputDStream<String> messages = RabbitMQUtils.createJavaDistributedStream(jssc, String.class, distributedKeys, rabbitMqConParams, messageHandler);
JavaDStream<String> machineDataRDD = messages.window(Durations.minutes(2),Durations.seconds(60)); //every 60 seconds one RDD is Created
machineDataRDD.print();
JavaPairDStream<String, String> pairedData = machineDataRDD.mapToPair(s -> new Tuple2<String, String>(getMap(s).get("DeviceId").toString(), s));
JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();
groupedData.foreachRDD(new VoidFunction<JavaPairRDD<String,Iterable<String>>>(){
#Override
public void call(JavaPairRDD<String, Iterable<String>> data) throws Exception {
data.foreachPartition(new VoidFunction<Iterator<Tuple2<String,Iterable<String>>>>(){
#Override
public void call(Iterator<Tuple2<String, Iterable<String>>> data) throws Exception {
while(data.hasNext()){
LOGGER.error("Machine Data == >>"+data.next());
}
}
});
}
});
jssc.start();
jssc.awaitTermination();
}
catch (Exception e)
{
e.printStackTrace();
}
The below grouping code gives us a Iterable of string for a Device , ideally we would like to get a DataSet
JavaPairDStream<String, String> pairedData = machineDataRDD.mapToPair(s -> new Tuple2<String, String>(getMap(s).get("DeviceId").toString(), s));
JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();
Important thing for me is the looping using foreachPartition so that code executing gets pushed to Worker Nodes.
After looking through more code samples and guidelines sqlcontext , sparksession are not serialized and available on the worker nodes , so we will be changing the strategy of not trying to build a dataset withing foreachpartition loop.

Spark Streaming Issue. ( JAVA )

I am trying to run a network word count job using spark streaming. The code is as below:
public class myclass {
public static Iterator<String> myfunc(String s){
String ar[]=s.split(" ");
List<String> l = new ArrayList<String>();
for(String a:ar){
l.add(a);
}
return l.iterator();
}
public static void main(String args[]) throws Exception{
SparkConf conf = new SparkConf().setAppName("trial");
JavaStreamingContext jssc = new JavaStreamingContext(conf,Durations.seconds(1));
JavaDStream<String> lines = jssc.socketTextStream("localhost", 7777);
JavaPairDStream<String,Integer> tup=lines.flatMap(s -> myfunc(s)).mapToPair(s -> new Tuple2<String,Integer>(s,1)).reduceByKey((a,b)-> a+b);
tup.print();
jssc.start();
jssc.awaitTermination();
jssc.stop();
jssc.close();
}
}
I ran this job using the command :
spark-submit --class myclass --master local <myjar.jar>
In another terminal I ran a local server as :
nc -lk 7777
The job runs fine, but it keeps going on and on..
It keeps on adding a new job for each millisecond. I tried adding lines on the local server it shows no error whatsoever but still the job goes on and on & output is shown.
I went through some online tutorials where with the same code, the result of the code is shown regularly like below :
time 14500000ms
__________________
<output>
But that doesn't happen in my case , it keeps on adding jobs like below :
Added jobs for time 1450000
Added jobs for time 1460000
Added jobs for time 1470000
. . . .
. . . .
What am I doing wrong ?

What is the difference between Spark Client and Spark Driver?

What is the difference between Spark Client and Spark Driver? I keep hearing that Spark Client is the one that does spark-submit but I never had to do that all I do is install Spark Libraries and Initialize spark context under public static void main(String[] args) like the code below
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.*;
import org.apache.spark.SparkConf;
import scala.Tuple2;
public class JavaWordCount {
public static void main(String[] args) {
// create Spark context with Spark configuration
JavaSparkContext sc = new JavaSparkContext(new SparkConf()
.setAppName("Spark Count"))
.setMaster(master_url)
.setJars(JavaSparkContext.jarOfClass(JavaWordCount.class));
// get threshold
final int threshold = Integer.parseInt(args[1]);
// read in text file and split each document into words
JavaRDD<String> tokenized = sc.textFile(args[0]).flatMap(
new FlatMapFunction() {
public Iterable call(String s) {
return Arrays.asList(s.split(" "));
}
}
);
// count the occurrence of each word
JavaPairRDD<String, Integer> counts = tokenized.mapToPair(
new PairFunction() {
public Tuple2 call(String s) {
return new Tuple2(s, 1);
}
}
).reduceByKey(
new Function2() {
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
}
);
// filter out words with fewer than threshold occurrences
JavaPairRDD<String, Integer> filtered = counts.filter(
new Function, Boolean>() {
public Boolean call(Tuple2 tup) {
return tup._2 >= threshold;
}
}
);
// count characters
JavaPairRDD<Character, Integer> charCounts = filtered.flatMap(
new FlatMapFunction<Tuple2<String, Integer>, Character>() {
#Override
public Iterable<Character> call(Tuple2<String, Integer> s) {
Collection<Character> chars = new ArrayList<Character>(s._1().length());
for (char c : s._1().toCharArray()) {
chars.add(c);
}
return chars;
}
}
).mapToPair(
new PairFunction<Character, Character, Integer>() {
#Override
public Tuple2<Character, Integer> call(Character c) {
return new Tuple2<Character, Integer>(c, 1);
}
}
).reduceByKey(
new Function2<Integer, Integer, Integer>() {
#Override
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
}
);
System.out.println(charCounts.collect());
}
}
In this code I dont see any spark-submit so which is the client and which is the driver here? Also I dont really understand what is happening with SparkConf.setJars? Why is that needed and what does it do?
The Master is the one who creates the Task and submits it to the Workers and coordinates between different Job Stages,whereas the Driver sits one step above the Master which is involved in creating the context and creating the RDD Graph,which it then submits to the Master.
The Spark Driver is a program that declares Transformations and actions on RDDs of data and submits such requests to the master.It basically creates your SparkContext.
Hence,the Driver prepares the context and "Declares" operations on data using RDD Transformations and Actions.The Driver then submits this serialized RDD graph to the Master.
The Master then creates the Task out of it and submit them to the Worker for the execution.It also coordinates different Job stages.
By Spark Client I'm assuming you're talking about Spark Master, which is the server responsible for scheduling Spark jobs.
The code you have works with the Spark API, but you have to run it, as you run any program. In order to run jobs in Spark, you have to schedule them with the relevant resource manager (Standalone, YARN or Mesos) to actually execute the job in the cluster environment.
After you assemble your JAR, you go to the Spark Master server and schedule the job using the spark-submit script. As the documentation states:
Once a user application is bundled, it can be launched using the
bin/spark-submit script. This script takes care of setting up the
classpath with Spark and its dependencies, and can support different
cluster managers and deploy modes that Spark supports:
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
Some of the commonly used options are:
--class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
--master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)
--deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)
†
--conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).
application-jar: Path to a bundled jar including your application and
all dependencies. The URL must be globally visible inside of your
cluster, for instance, an hdfs:// path or a file:// path that is
present on all nodes. application-arguments: Arguments passed to the
main method of your main class, if any
Also I dont really understand what is happening with
SparkContext.setJars? Why is that needed and what does it do?
I assume you mean SparkConf.setJars. What that settings helps do is distribute JARs which aren't packaged into an uber JAR to the worker nodes, if needed. This is identical to passing the --jars flag to spark-submit.

Spark does not distribute work

I set up two virtual machines to test Spark in a distributed setup. It seems that my jobs are only run locally on one node, the one I use to submit the job.
One node is run as datanode/worker node and the second one is additionally namenode/secondary-namenode
I configured the underlying hadoop to use Yarn.
The jps command confirms that the various services are started correctly and basically available after I expected the start*-scripts in hadoop/spark.
I use htop to "track" if the other node is used, but the cpu usage jumps between 2 an 3% --> probably not used. I wonder what I am missing here.
I start my job with this command:
./spark-submit --class com.... DistributedTest --master yarn-client myJar.jar
This is the class I am executing (the data.txt file is about 1GB pure text)
public class DistributedTest
{
public static void main(String[] args)
throws IOException
{
SparkConf conf = new SparkConf().setAppName("Simple Application");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> logData = sc.textFile("hdfs://woodpecker:10001/husr/data.txt");// .persist(StorageLevel.DISK_ONLY());
long numAs = logData.filter(new Function<String, Boolean>()
{
public Boolean call(String s)
{
return s.contains("a");
}
}).count();
long numBs = logData.filter(new Function<String, Boolean>()
{
public Boolean call(String s)
{
return s.contains("b");
}
}).count();
sc.close();
String s = "Lines with a: " + numAs + ", lines with b: " + numBs;
System.out.println(s);
}
}
Anyone any ideas why my setup does not distribute
The filter operation is definitely distributed, and count is partially computed on a worker, while the total count is calculated back on the master. The result of the count is also on the master.
Filtering one GB of data isn't really going to stress Spark anyway, so you should only see a short CPU spike on the worker. Rather take a look at I/O usage.
Your app is fine, there must be something wrong with your setup.
First, you go through your Spark UI and make sure you have multiple workers and it is also depend on how much partition you have in your Rdd.

Resources