Spark Jobserver: Very large task size - apache-spark

I'm getting messages along the lines of the following in my Spark JobServer logs:
Stage 14 contains a task of very large size (9523 KB). The maximum recommended task size is 100 KB.
I'm creating my RDD with this code:
List<String> data = new ArrayList<>();
for (int i = 0; i < 2000000; i++) {
data.add(UUID.randomUUID().toString());
}
JavaRDD<String> randomData = sc.parallelize(data).cache();
I understand that the first time I run this is could be big, because the data in the RDD doesn't exist on the executor nodes yet.
I would have thought that it would be quick on subsequent runs though (I'm using Spark JobServer to keep the session context around, and reuse the RDD), since I'm reusing the RDD so the data should exist on the nodes.
The code is very simple:
private static Function<String, Boolean> func = new Function<String, Boolean>() {
public Boolean call(String s) {
return s.contains("a");
}
};
----
rdd.filter(aFunc).count();

Related

Apache Spark -- Data Grouping and Execution in worker nodes

We are getting live machine data as json and we get this data from RabbitMQ. below is a sample of the json,
{"DeviceId":"MAC-1001","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:35","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1001","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:36","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1002","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:37","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
{"DeviceId":"MAC-1002","DeviceType":"Sim-1","TimeStamp":"05-12-2017 10:25:38","data":{"Rate":10,"speed":2493,"Mode":1,"EMode":2,"Run":1}}
The data is windowed for duration of 'X' minutes and then below is what we want to achieve
Group the data by deviceId, this is done but not sure if we can get a DataSet
We want to loop through the above grouped data and execute for aggregation logic for each device using the foreachPartition so that the code is executed within worker nodes.
Please correct me if my thought process is wrong here.
Our earlier code was collecting the data,looping through the RDD's,convert them to DataSet and applying aggregation logic on the DataSet using Spark SqlContext api's.
When doing load testing we saw 90% of the processing was happening in Master node and after a while the cpu usage spiked to 100% and the process bombed out.
So we are now trying to re-engineer the whole process to execute maximum of logic in worker nodes.
Below is the code so far that actually works in worker node but we are yet to get a DataSet for aggregating Logic
public static void main(String[] args) {
try {
mconf = new SparkConf();
mconf.setAppName("OnPrem");
mconf.setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(mconf);
jssc = new JavaStreamingContext(sc, Durations.seconds(60));
SparkSession spksess = SparkSession.builder().appName("Onprem").getOrCreate();
//spksess.sparkContext().setLogLevel("ERROR");
Map<String, String> rabbitMqConParams = new HashMap<String, String>();
rabbitMqConParams.put("hosts", "localhost");
rabbitMqConParams.put("userName", "guest");
rabbitMqConParams.put("password", "guest");
rabbitMqConParams.put("vHost", "/");
rabbitMqConParams.put("durable", "true");
List<JavaRabbitMQDistributedKey> distributedKeys = new LinkedList<JavaRabbitMQDistributedKey>();
distributedKeys.add(new JavaRabbitMQDistributedKey(QUEUE_NAME, new ExchangeAndRouting(EXCHANGE_NAME, "fanout", ""), rabbitMqConParams));
Function<Delivery, String> messageHandler = new Function<Delivery, String>() {
public String call(Delivery message) {
return new String(message.getBody());
}
};
JavaInputDStream<String> messages = RabbitMQUtils.createJavaDistributedStream(jssc, String.class, distributedKeys, rabbitMqConParams, messageHandler);
JavaDStream<String> machineDataRDD = messages.window(Durations.minutes(2),Durations.seconds(60)); //every 60 seconds one RDD is Created
machineDataRDD.print();
JavaPairDStream<String, String> pairedData = machineDataRDD.mapToPair(s -> new Tuple2<String, String>(getMap(s).get("DeviceId").toString(), s));
JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();
groupedData.foreachRDD(new VoidFunction<JavaPairRDD<String,Iterable<String>>>(){
#Override
public void call(JavaPairRDD<String, Iterable<String>> data) throws Exception {
data.foreachPartition(new VoidFunction<Iterator<Tuple2<String,Iterable<String>>>>(){
#Override
public void call(Iterator<Tuple2<String, Iterable<String>>> data) throws Exception {
while(data.hasNext()){
LOGGER.error("Machine Data == >>"+data.next());
}
}
});
}
});
jssc.start();
jssc.awaitTermination();
}
catch (Exception e)
{
e.printStackTrace();
}
The below grouping code gives us a Iterable of string for a Device , ideally we would like to get a DataSet
JavaPairDStream<String, String> pairedData = machineDataRDD.mapToPair(s -> new Tuple2<String, String>(getMap(s).get("DeviceId").toString(), s));
JavaPairDStream<String, Iterable<String>> groupedData = pairedData.groupByKey();
Important thing for me is the looping using foreachPartition so that code executing gets pushed to Worker Nodes.
After looking through more code samples and guidelines sqlcontext , sparksession are not serialized and available on the worker nodes , so we will be changing the strategy of not trying to build a dataset withing foreachpartition loop.

how to count number of items per second in spark streaming?

I get a json stream and I want to computer number of items that has a status of "Pending" every second. How do I do that? I have the code below so far and 1) I am not sure if it is correct. 2) It returns me a Dstream but my objective is to store a number every second to cassandra or queue or you can imagine there is function public void store(Long number){} .
// #1
jsonMessagesDStream
.filter(new Function<String, Boolean>() {
#Override
public Boolean call(String v1) throws Exception {
JsonParser parser = new JsonParser();
JsonObject jsonObj = parser.parse(v1).getAsJsonObject();
if (jsonObj != null && jsonObj.has("status")) {
return jsonObj.get("status").getAsString().equalsIgnoreCase("Pending");
}
return false;
}
}).countByValue().foreachRDD(new VoidFunction<JavaPairRDD<String, Long>>() {
#Override
public void call(JavaPairRDD<String, Long> stringLongJavaPairRDD) throws Exception {
store(stringLongJavaPairRDD.count());
}
});
Tried the following: still didn't work since it prints zero all the time not sure if it is right?
// #2
jsonMessagesDStream
.filter(new Function<String, Boolean>() {
#Override
public Boolean call(String v1) throws Exception {
JsonParser parser = new JsonParser();
JsonObject jsonObj = parser.parse(v1).getAsJsonObject();
if (jsonObj != null && jsonObj.has("status")) {
return jsonObj.get("status").getAsString().equalsIgnoreCase("Pending");
}
return false;
}
}).foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> stringJavaRDD) throws Exception {
store(stringJavaRDD.count());
}
});
part of the stack trace
16/09/10 17:51:39 INFO SparkContext: Starting job: count at Consumer.java:88
16/09/10 17:51:39 INFO DAGScheduler: Got job 17 (count at Consumer.java:88) with 4 output partitions
16/09/10 17:51:39 INFO DAGScheduler: Final stage: ResultStage 17 (count at Consumer.java:88)
16/09/10 17:51:39 INFO DAGScheduler: Parents of final stage: List()
16/09/10 17:51:39 INFO DAGScheduler: Missing parents: List()
16/09/10 17:51:39 INFO DAGScheduler: Submitting ResultStage 17 (MapPartitionsRDD[35] at filter at Consumer.java:72), which has no missing parents
BAR gets printed but not FOO
//Debug code
jsonMessagesDStream
.filter(new Function<String, Boolean>() {
#Override
public Boolean call(String v1) throws Exception {
System.out.println("****************FOO******************");
JsonParser parser = new JsonParser();
JsonObject jsonObj = parser.parse(v1).getAsJsonObject();
if (jsonObj != null && jsonObj.has("status")) {
return jsonObj.get("status").getAsString().equalsIgnoreCase("Pending");
}
return false;
}
}).foreachRDD(new VoidFunction<JavaRDD<String>>() {
#Override
public void call(JavaRDD<String> stringJavaRDD) throws Exception {
System.out.println("*****************BAR******************");
store(stringJavaRDD.count());
}
});
Since you have already filtered the result-set, you could just do a count() on the DStream/RDD.
Also I dont think you would need windowing here, if you are reading from the source every second. Windowing is needed, when the micro-batch interval doesn't match with the aggregation frequency. Are you looking at a micro-batch frequency of less than a second?
It returns me a Dstream but my objective is to store a number every second to cassandra or queue
The way Spark works is it gives a DStream every time you do a computation on an existing DStream. That way you could easily chain functions together. You should also be aware of the distinction between transformations and actions in Spark. Functions like filter(), count() etc. are transformations, in the sense that they operate on a DStream and give a new DStream. But if you need side-effects (like printing, pushing to a DB, etc.), you should be looking at Spark actions.
If you need to push DStream to cassandra, you should look at cassandra connectors which will have functions exposed (actions in Spark terminology) that you can use to push data into cassandra.
You can use sliding window of 1 second along with reduceByKey function irrespective of batch interval. Once you choose the 1 second slide interval you will receive a event for store call every second.

Refresh RDD LKP Table Via Spark Streaming

I have a spark streaming application that is pulling data from a kafka topic and then correlating that data with a dimensional lookup In spark. While I have managed to load in the dimensional lookup table from hive on the first run into an RDD; I want this dimensional lkp rdd to be refreshed every one hour.
From my understanding SparkStreaming is effectively a spark scheduler, I am wondering if its possible to create a JavaDStream of the Dimensional lookup RDD, and have it refresh on a scheduled interval using Spark Streaming. The problem is I have no idea how to approach this, from my understanding an RDD is immutable, meaning is it even possible to refresh a JavaDStream in spark and join it with JavaDStream that is running on a different schedule?
Current Code:
System.out.println("Loading from Hive Tables...");
//Retrieve Dimensional LKP data from hive table into RDD (CombineIVAPP function retrieves the data from Hive and performs initial joins)
final JavaPairRDD<String, Tuple2<modelService,modelONT>> LKP_IVAPP_DIM = CombineIVAPP();
LKP_IVAPP_DIM.cache();
System.out.println("Mapped Tables to K/V Pairs");
//Kafka Topic settings
Map<String, Integer> topicMap = new HashMap<String, Integer>();
topicMap.put(KAFKA_TOPIC,KAFKA_PARA);
//Begin to stream from Kafka Topic
JavaPairReceiverInputDStream<String, String> messages = KafkaUtils.createStream(
jssc, ZOOKEEPER_URL, KAFKA_GROUPID, topicMap);
//Map messages from Kafka Stream to Tuple
JavaDStream<String> json = messages.map(
new Function<Tuple2<String, String>, String>() {
#Override
public String call(Tuple2<String, String> message) {
return message._2();
}
}
);
//Map kafka JSON string to K/V RDD
JavaPairDStream<String, modelAlarms> RDD_ALARMS = json.mapToPair(new KafkaToRDD());
//Remove the null values
JavaPairDStream<String, modelAlarms> RDD_ALARMS_FILTERED = RDD_ALARMS.filter(new Function<Tuple2<String, modelAlarms>, Boolean>() {
#Override
public Boolean call(Tuple2<String, modelAlarms> item) {
return item != null;
}
});
//Join Alarm data from Kafka topic with hive lkp table LKP_IVAPP_DIM
JavaPairDStream<String, Tuple2<modelAlarms,Tuple2<modelService,modelONT>>> RDD_ALARMS_JOINED = RDD_ALARMS_FILTERED.transformToPair(new Function<JavaPairRDD<String, modelAlarms>, JavaPairRDD<String, Tuple2<modelAlarms, Tuple2<modelService,modelONT>>>>() {
#Override
public JavaPairRDD<String, Tuple2<modelAlarms, Tuple2<modelService,modelONT>>> call(JavaPairRDD<String, modelAlarms> v1) throws Exception {
return v1.join(LKP_IVAPP_DIM);
}
});

Measure duration of executing combineByKey function in Spark

I want to measure the time that the execution of combineByKey function needs. I always get a result of 20-22 ms (HashPartitioner) and ~350ms (without pratitioning) with the code below, independent of the file size I use (file0: ~300 kB, file1: ~3GB, file2: ~8GB)! Can this be true? Or am I doing something wrong???
JavaPairRDD<Integer, String> pairRDD = null;
JavaPairRDD<Integer, String> partitionedRDD = null;
JavaPairRDD<Integer, Float> consumptionRDD = null;
boolean partitioning = true; //or false
int partitionCount = 100; // between 1 and 200 I cant see any difference in the duration!
SparkConf conf = new SparkConf();
JavaSparkContext sc = new JavaSparkContext(conf);
input = sc.textFile(path);
pairRDD = mapToPair(input);
partitionedRDD = partition(pairRDD, partitioning, partitionsCount);
long duration = System.currentTimeMillis();
consumptionRDD = partitionedRDD.combineByKey(createCombiner, mergeValue, mergeCombiners);
duration = System.currentTimeMillis() - duration; // Measured time always the same, independent of file size (~20ms with / ~350ms without partitioning)
// Do an action
Tuple2<Integer, Float> test = consumptionRDD.takeSample(true, 1).get(0);
sc.stop();
Some helper methods (shouldn't matter):
// merging function for a new dataset
private static Function2<Float, String, Float> mergeValue = new Function2<Float, String, Float>() {
public Float call(Float sumYet, String dataSet) throws Exception {
String[] data = dataSet.split(",");
float value = Float.valueOf(data[2]);
sumYet += value;
return sumYet;
}
};
// function to sum the consumption
private static Function2<Float, Float, Float> mergeCombiners = new Function2<Float, Float, Float>() {
public Float call(Float a, Float b) throws Exception {
a += b;
return a;
}
};
private static JavaPairRDD<Integer, String> partition(JavaPairRDD<Integer, String> pairRDD, boolean partitioning, int partitionsCount) {
if (partitioning) {
return pairRDD.partitionBy(new HashPartitioner(partitionsCount));
} else {
return pairRDD;
}
}
private static JavaPairRDD<Integer, String> mapToPair(JavaRDD<String> input) {
return input.mapToPair(new PairFunction<String, Integer, String>() {
public Tuple2<Integer, String> call(String debsDataSet) throws Exception {
String[] data = debsDataSet.split(",");
int houseId = Integer.valueOf(data[6]);
return new Tuple2<Integer, String>(houseId, debsDataSet);
}
});
}
The web ui provides you with details on jobs/stage that your application has run. It details the time for each of them, and you can now filter various details such as Scheduler Delay, Task Deserialization Time, and Result Serialization Time.
The default port for the webui is 8080. Completed application are listed there, and you can then click on the name, or craft the url like this: x.x.x.x:8080/history/app-[APPID] to access those details.
I don't believe any other "built-in" methods exist to monitor the running time of a task/stage. Otherwise, you may want to go deeper and use a JVM debugging framework.
EDIT: combineByKey is a transformation, which means that it is not applied on your RDD, as opposed to actions (read more the lazy behaviour of RDDs here, chapter 3.1). I believe the time difference you're observing comes from the time SPARK takes to create the actual data structure when partitioning or not.
If a difference there is, you'll see it at action's time (takeSample here)

Spark does not distribute work

I set up two virtual machines to test Spark in a distributed setup. It seems that my jobs are only run locally on one node, the one I use to submit the job.
One node is run as datanode/worker node and the second one is additionally namenode/secondary-namenode
I configured the underlying hadoop to use Yarn.
The jps command confirms that the various services are started correctly and basically available after I expected the start*-scripts in hadoop/spark.
I use htop to "track" if the other node is used, but the cpu usage jumps between 2 an 3% --> probably not used. I wonder what I am missing here.
I start my job with this command:
./spark-submit --class com.... DistributedTest --master yarn-client myJar.jar
This is the class I am executing (the data.txt file is about 1GB pure text)
public class DistributedTest
{
public static void main(String[] args)
throws IOException
{
SparkConf conf = new SparkConf().setAppName("Simple Application");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> logData = sc.textFile("hdfs://woodpecker:10001/husr/data.txt");// .persist(StorageLevel.DISK_ONLY());
long numAs = logData.filter(new Function<String, Boolean>()
{
public Boolean call(String s)
{
return s.contains("a");
}
}).count();
long numBs = logData.filter(new Function<String, Boolean>()
{
public Boolean call(String s)
{
return s.contains("b");
}
}).count();
sc.close();
String s = "Lines with a: " + numAs + ", lines with b: " + numBs;
System.out.println(s);
}
}
Anyone any ideas why my setup does not distribute
The filter operation is definitely distributed, and count is partially computed on a worker, while the total count is calculated back on the master. The result of the count is also on the master.
Filtering one GB of data isn't really going to stress Spark anyway, so you should only see a short CPU spike on the worker. Rather take a look at I/O usage.
Your app is fine, there must be something wrong with your setup.
First, you go through your Spark UI and make sure you have multiple workers and it is also depend on how much partition you have in your Rdd.

Resources