Spark Streaming - long garbage collection time - apache-spark

I'm running a Spark Streaming application with 1-hour batches to join two data feeds and write the output to disk. The total size of one data feed is about 40 GB per hour (split in multiple files), while the size of the second data feed is about 600-800 MB per hour (also split in multiple files). Due to application constraints, I may not be able to run smaller batches. Currently, it takes about 20 minutes to produce the output in a cluster with 140 cores and 700 GB of RAM. I'm running 7 workers and 28 executors, each with 5 cores and 22 GB of RAM.
I execute mapToPair(), filter(), and reduceByKeyAndWindow(1 hour batch) on the 40 GB data feed. Most of the computation time is spent on these operations. What worries me is the Garbage Collection (GC) execution time per executor, which goes from 25 secs to 9.2 mins. I attach two screenshots below: one lists the GC time and one prints out GC comments for a single executor. I anticipate that the executor that spends 9.2 mins in doing garbage collection is eventually killed by the Spark driver.
I think these numbers are too high. Do you have any suggestion about keeping GC time low? I'm already using Kryo Serializer, ++UseConcMarkSweepGC, and spark.rdd.compress=true.
Is there anything else that would help?
EDIT
This is a snippet of my code:
// The data feed is then mapped to a Key/Value RDD. Some string in the original RDD Stream will be filtered out according to the business logic
JavaPairDStream<String, String> filtered_data = orig_data.mapToPair(parserFunc)
.filter(new Function<scala.Tuple2<String, String>, Boolean>() {
#Override
public Boolean call(scala.Tuple2<String, String> t) {
return (!t._2().contains("null"));
}
});
// WINDOW_DURATION = 2 hours, SLIDE_DURATION = 1 hour. The data feed will be later joined with another feed.
// These two feeds are asynchronous: records in the second data feed may match records that appeared in the first data feed up to 2 hours before.
// I need to save RDDs of the first data feed because they may be joined later.
// I'm using reduceByKeyAndWindow() instead of window() because I can build this "cache" incrementally.
// For a given key, appendString() simply appends new a new string to the value, while removeString() removes the strings (i.e. parts of the values) that go out of scope (i.e. out of WINDOW_INTERVAL)
JavaPairDStream<String, String> windowed_data = filtered_data.reduceByKeyAndWindow(appendString, removeString, Durations.seconds(WINDOW_DURATION), Durations.seconds(SLIDE_DURATION))
.flatMapValues(new Function<String, Iterable<String>>() {
#Override
public Iterable<String> call(String s) {
return Arrays.asList(s.split(","));
}
});
// This is a second data feed, which is also transformed to a K/V RDD for the join operation with the first feed
JavaDStream<String> second_stream = jssc.textFileStream(MSP_DIR);
JavaPairDStream<String, String> ss_kv = second_stream.mapToPair(new PairFunction<String, String, String>() {
#Override
public scala.Tuple2<String, String> call(String row) {
String[] el = row.split("\\|");
return new scala.Tuple2(el[9], row);
}
});
JavaPairDStream<String, String> joined_stream = ss_kv.join(windowed_data)
// Use foreachRDD() to save joined_stream to HDFS

Related

Write >1 files (limited by size) from a spark partition

I am fetching an RDBMS table using JDBC with some 10-20 partitions using ROW_NUM. Then from each of these partitions I want to process/format the data, and write one or more files out to file storage based on the file size Each file must be less 500MB. How do I write multiple files out from a single partition? This spark config property 'spark.sql.files.maxRecordsPerFile' wont work for me because each row can be of different size as there is blob data in the row. The size of this blob may vary anywhere from a few hundred bytes to 50MB. So I cannot really limit the write by maxRecordsPerFile.
How do I further split each DB partition, into smaller partitions and then write out the files?
If I do a repartition, it does shuffle across all executors. I am trying to keep all data within the same executor to avoid shuffle. Is it possible to repartition within the same executor core (repartition the current partition), and then write a single file from each?
I tried this to atleats calculate the total size of my payload and then repartition. but this fails with "Local variable numOfPartitions defined in an enclosing scope must be final or effectively final". what am I doing wrong. How can I fix this code?
...
int numOfPartitions = 1;
JavaRDD<String> tgg = xmlDataSet.toJavaRDD().mapPartitions ( xmlRows -> {
long totalSize = 0;
List<String> strLst = new ArrayList<String>();
while (xmlRows.hasNext()) {
String xmlString = xmlString = blobToString(xmlRows.next());
totalSize = totalSize + xmlString.getBytes().length;
strLst.add(xmlString);
if (totalSize > 10000) {
numOfPartitions++;
}
}
return strLst.iterator();
});
...

Spark binaryRecords() giving less performance as compare to textFile()

I have spark job code as below. Which works fine with below configuration on cluster.
String path = "/tmp/one.txt";
JavaRDD<SomeClass> jRDD = spark.read()
.textFile(path)
.javaRDD()
.map(line -> {
return new SomeClass(line);
});
Dataset responseSet = sparkSession.createDataFrame(jRDD, SomeClass.class);
responseSet.write()
.format("text")
.save(path + "processed");
Whereas, If I want to read binary file(same size as text) it takes much more time.
String path = "/tmp/one.txt";
JavaRDD<SomeClass> jRDD = sparkContext
.binaryRecords(path, 10000, new Configuration())
.toJavaRDD()
.map(line -> {
return new SomeClass(line);
});
Dataset responseSet = spark.createDataFrame(jRDD, SomeClass.class);
responseSet.write()
.format("text")
.save(path + "processed");
Below is my configuration.
driver-memory 8g
executor-memory 6g
num-executors 16
Time taken by first code with 150 MB file is 1.30 mins.
Time taken by second code with 150 MB file is 4 mins.
Also, first code was able to run on all 16 executors whereas second uses only one.
ny suggestions why it is slow?
I found the issue. The textFile()method was creating 16 partitions(you can checknumOfPartitions using getNumPartitions() method on RDD) whereas binaryRecords() created only 1(Java binaryRecords doesn't provide overloaded method which specifies num of partitions to be created).
I increased numOfPartitions on RDD created by binaryRecords() by using repartition(NUM_OF_PARTITIONS) method on RDD.

File is overwritten while using saveAsNewAPIHadoopFile

We are using Spark 1.4 for Spark Streaming. Kafka is data source for the Spark Stream.
Records are published on Kafka every second. Our requirement is to store records published on Kafka in a single folder per minute. The stream will read records every five seconds. For instance records published during 1200 PM and 1201PM are stored in folder "1200"; between 1201PM and 1202PM in folder "1201" and so on.
The code I wrote is as follows
//First Group records in RDD by date
stream.foreachRDD (rddWithinStream -> {
JavaPairRDD<String, Iterable<String>> rddGroupedByDirectory = rddWithinStream.mapToPair(t -> {
return new Tuple2<String, String> (targetHadoopFolder, t._2());
}).groupByKey();
// All records grouped by folders they will be stored in
// Create RDD for each target folder.
for (String hadoopFolder : rddGroupedByDirectory.keys().collect()) {
JavaPairRDD <String, Iterable<String>> rddByKey = rddGroupedByDirectory.filter(groupedTuples -> {
return groupedTuples._1().equals(hadoopFolder);
});
// And store it in Hadoop
rddByKey.saveAsNewAPIHadoopFile(directory, String.class, String.class, TextOutputFormat.class);
}
Since the Stream processes data every five seconds, saveAsNewAPIHadoopFile gets invoked multiple times in a minute. This causes "Part-00000" file to be overwritten every time.
I was expecting that in the directory specified by "directory" parameter, saveAsNewAPIHadoopFile will keep creating part-0000N file even when I've a sinlge worker node.
Any help/alternatives are greatly appreciated.
Thanks.
In this case you have to build your output path and filename by yourself. Incremental file naming works only when the output operation is called directly on DStream (not per each RDD).
The argument function in stream.foreachRDD can get Time information for each micro-batch. Referring to Spark documentation:
def foreachRDD(foreachFunc: (RDD[T], Time) ⇒ Unit)
So you can save each RDD as follows:
stream.foreachRDD((rdd, time) -> {
val directory = timeToDirName(prefix, time)
rdd.saveAsNewAPIHadoopFile(directory, String.class, String.class, TextOutputFormat.class);
})
You can try this -
Split process into 2 steps :
Step-1 :- Write Avro file using saveAsNewAPIHadoopFile to <temp-path>
Step-2 :- Move file from <temp-path> to <actual-target-path>
Hope this is helpful.

Spark job in cluster is very slow ("executor computing time" increases for each task)

Usecase : We have collection of small documents to be analysed.
Spark setup : Spark1.5 with standalone cluster, 2 worker nodes with 8 core each and 10GB memory each.
Program :
1)Documents are loaded into List<Tuple2<Doc_Name, Doc_Contents>>
2)parallelizePairs is invoked with above collection and partition size is set to 100.
3)Map() - Each document is analysed and then a json is generated.
4)collect()
Issue :
For tasks that are scheduled first(task 0 to 16), "executor computing time" is around 3 mins, but for tasks that are scheduled later "executor computing time" increase each time(8 mins, 12 mins and 17 mins).Here is the screen shot
Observation:
When i run the code in local mode, each task takes same amount and everything works fine. When same code is excuted in standalone cluster mode, each task becomes slower and slower as shown the screenshot. I did try client and cluster deploy mode, in both cases when the application in executed in standalone cluster, "Executor Computing Time" increases for subsequent tasks.
Sample Code:
List<Tuple2<String, String>> data = getData(); -> data is around 7MB
JavaPairRDD<String, String> pairRDD = sc.parallelizePairs(data, 100);
JavaRDD<Tuple2<String, String>> aRDD = resumePairRDD.map(new ProcessDoc());
List<Tuple2<String, String>> output = aRDD.collect();
public class ProcessDoc implements Function<Tuple2<String, String>, Tuple2<String, String>>, Serializable {
#Override
public Tuple2<String, String> call(Tuple2<String, String> fileNameContentTuple) throws Exception {
String fileName = FilenameUtils.getBaseName(fileNameContentTuple._1());
String content = fileNameContentTuple._2();
// Below API has dependency on many other classes and approximately takes around 2 seconds for each record.
DocumentAnalyzerResponse documentAnalyzerResponse = DocumentAnalyzerFactory.parse(content);
String resultJson = objectMapper.writeValueAsString(documentAnalyzerResponse);
// Upload the resultJson to s3
}
}
Thanks for all comments.

Why is huge data shuffling in Spark when using union()/coalesce(1,false) on DataFrame?

I have Spark job which does some processing on ORC data and stores back ORC data using DataFrameWriter save() API introduced in Spark 1.4.0. I have the following piece of code which is using heavy shuffle memory. How do I optimize below code? Is there anything wrong with it? It is working fine as expected only causing slowness because of GC pause and shuffles lots of data so hitting memory issues. I am new to Spark.
JavaRDD<Row> updatedDsqlRDD = orderedFrame.toJavaRDD().coalesce(1, false).map(new Function<Row, Row>() {
#Override
public Row call(Row row) throws Exception {
List<Object> rowAsList;
Row row1 = null;
if (row != null) {
rowAsList = iterate(JavaConversions.seqAsJavaList(row.toSeq()));
row1 = RowFactory.create(rowAsList.toArray());
}
return row1;
}
}).union(modifiedRDD);
DataFrame updatedDataFrame = hiveContext.createDataFrame(updatedDsqlRDD,renamedSourceFrame.schema());
updatedDataFrame.write().mode(SaveMode.Append).format("orc").partitionBy("entity", "date").save("baseTable");
Edit
As per suggestion I tried to convert above code into the following using mapPartitionsWithIndex() but I still see data shuffling it is better than above code but still it fails by hitting GC limit and throws OOM or goes into GC pause for long and timeout and YARN will kill executor.
I am using spark.storage.memoryFraction as 0.5 and spark.shuffle.memoryFraction as 0.4; I tried to use default and changed many combinations, but nothing helped.
JavaRDD<Row> indexedRdd = sourceRdd.cache().mapPartitionsWithIndex(new Function2<Integer, Iterator<Row>, Iterator<Row>>() {
#Override
public Iterator<Row> call(Integer ind, Iterator<Row> rowIterator) throws Exception {
List<Row> rowList = new ArrayList<>();
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
List<Object> rowAsList = iterate(JavaConversions.seqAsJavaList(row.toSeq()));
Row updatedRow = RowFactory.create(rowAsList.toArray());
rowList.add(updatedRow);
}
return rowList.iterator();
}
}, true).coalesce(200,true);
Coalescing an RDD or Dataframe to a single partition means that all your processing is happening on a single machine. This is not a good thing for a variety of reasons: all of the data has to be shuffled across the network, there is no more parallelism, etc. Instead you should look at other operators like reduceByKey, mapPartitions, or really pretty much anything besides coalescing the data to a single machine.
Note: looking are your code I don't see why you are bringing it down to a single machine, you can probably just remove that part.

Resources