Spark streaming creates one task per input file - apache-spark

I am processing sequence of input files with Spark streaming.
Spark streaming creates one task per input file and corresponding no of partitions and output part files.
JavaPairInputDStream<Text, CustomDataType> myRDD =
jssc.fileStream(path, Text.class, CustomDataType.class, SequenceFileInputFormat.class,
new Function<Path, Boolean>() {
#Override
public Boolean call(Path v1) throws Exception {
return Boolean.TRUE;
}
}, false);
For example if there are 100 input files in an interval.
Then there will be 100 part files in the output file.
What each part file represents?
(output from a task)
How to get reduce the no of output files (2 or 4 ...)?
Does this depend on no of partitioners?

Each file represents an RDD partition. If you want to reduce the number of partitions you can call repartition or coalesce with the number of partitions you wish to have.
https://spark.apache.org/docs/1.3.1/programming-guide.html#transformations

Related

Write >1 files (limited by size) from a spark partition

I am fetching an RDBMS table using JDBC with some 10-20 partitions using ROW_NUM. Then from each of these partitions I want to process/format the data, and write one or more files out to file storage based on the file size Each file must be less 500MB. How do I write multiple files out from a single partition? This spark config property 'spark.sql.files.maxRecordsPerFile' wont work for me because each row can be of different size as there is blob data in the row. The size of this blob may vary anywhere from a few hundred bytes to 50MB. So I cannot really limit the write by maxRecordsPerFile.
How do I further split each DB partition, into smaller partitions and then write out the files?
If I do a repartition, it does shuffle across all executors. I am trying to keep all data within the same executor to avoid shuffle. Is it possible to repartition within the same executor core (repartition the current partition), and then write a single file from each?
I tried this to atleats calculate the total size of my payload and then repartition. but this fails with "Local variable numOfPartitions defined in an enclosing scope must be final or effectively final". what am I doing wrong. How can I fix this code?
...
int numOfPartitions = 1;
JavaRDD<String> tgg = xmlDataSet.toJavaRDD().mapPartitions ( xmlRows -> {
long totalSize = 0;
List<String> strLst = new ArrayList<String>();
while (xmlRows.hasNext()) {
String xmlString = xmlString = blobToString(xmlRows.next());
totalSize = totalSize + xmlString.getBytes().length;
strLst.add(xmlString);
if (totalSize > 10000) {
numOfPartitions++;
}
}
return strLst.iterator();
});
...

Converting Dataframe to RDD reduces partitions

In our code, Dataframe was created as :
DataFrame DF = hiveContext.sql("select * from table_instance");
When I convert my dataframe to rdd and try to get its number of partitions as
RDD<Row> newRDD = Df.rdd();
System.out.println(newRDD.getNumPartitions());
It reduces the number of partitions to 1(1 is printed in the console). Originally my dataframe has 102 partitions .
UPDATE:
While reading , I repartitoned the dataframe :
DataFrame DF = hiveContext.sql("select * from table_instance").repartition(200);
and then converted to rdd , so it gave me 200 partitions only.
Does
JavaSparkContext
has a role to play in this? When we convert a dataframe to rdd , is default minimum partitions flag also considered at the spark context level?
UPDATE:
I made a seperate sample program in which I read the exact same table into dataframe and converted to rdd. No extra stage was created for RDD conversion and the partition count was also correct. I am now wondering what different am I doing in my main program.
Please let me know if my understanding is wrong here.
It basically depends on the implementation of hiveContext.sql(). Since I am new to Hive, my guess is hiveContext.sql doesn't know OR is not able to split the data present in the table.
For example, when you read a text file from HDFS, spark context considers the number of blocks used by that file to determine the partitions.
What you did with repartition is the obvious solution for these kinds of problems.(Note: repartition may cause a shuffle operation if proper partitioner is not used, hash Partitioner is used by default)
Coming to your doubt, hiveContext may consider the default minimum partition property. But, relying on default property is not going to
solve all your problems. For instance, if your hive table's size increases, your program still uses the default number of partitions.
Update: Avoid shuffle during repartition
Define your custom partitioner:
public class MyPartitioner extends HashPartitioner {
private final int partitions;
public MyPartitioner(int partitions) {
super();
this.partitions = partitions;
}
#Override
public int numPartitions() {
return this.partitions;
}
#Override
public int getPartition(Object key) {
if (key instanceof String) {
return super.getPartition(key);
} else if (key instanceof Integer) {
return (Integer.valueOf(key.toString()) % this.partitions);
} else if (key instanceof Long) {
return (int)(Long.valueOf(key.toString()) % this.partitions);
}
//TOD ... add more types
}
}
Use your custom partitioner:
JavaPairRDD<Long, SparkDatoinDoc> pairRdd = hiveContext.sql("select * from table_instance")
.mapToPair( //TODO ... expose the column as key)
rdd = rdd.partitionBy(new MyPartitioner(200));
//... rest of processing

Spark Streaming - long garbage collection time

I'm running a Spark Streaming application with 1-hour batches to join two data feeds and write the output to disk. The total size of one data feed is about 40 GB per hour (split in multiple files), while the size of the second data feed is about 600-800 MB per hour (also split in multiple files). Due to application constraints, I may not be able to run smaller batches. Currently, it takes about 20 minutes to produce the output in a cluster with 140 cores and 700 GB of RAM. I'm running 7 workers and 28 executors, each with 5 cores and 22 GB of RAM.
I execute mapToPair(), filter(), and reduceByKeyAndWindow(1 hour batch) on the 40 GB data feed. Most of the computation time is spent on these operations. What worries me is the Garbage Collection (GC) execution time per executor, which goes from 25 secs to 9.2 mins. I attach two screenshots below: one lists the GC time and one prints out GC comments for a single executor. I anticipate that the executor that spends 9.2 mins in doing garbage collection is eventually killed by the Spark driver.
I think these numbers are too high. Do you have any suggestion about keeping GC time low? I'm already using Kryo Serializer, ++UseConcMarkSweepGC, and spark.rdd.compress=true.
Is there anything else that would help?
EDIT
This is a snippet of my code:
// The data feed is then mapped to a Key/Value RDD. Some string in the original RDD Stream will be filtered out according to the business logic
JavaPairDStream<String, String> filtered_data = orig_data.mapToPair(parserFunc)
.filter(new Function<scala.Tuple2<String, String>, Boolean>() {
#Override
public Boolean call(scala.Tuple2<String, String> t) {
return (!t._2().contains("null"));
}
});
// WINDOW_DURATION = 2 hours, SLIDE_DURATION = 1 hour. The data feed will be later joined with another feed.
// These two feeds are asynchronous: records in the second data feed may match records that appeared in the first data feed up to 2 hours before.
// I need to save RDDs of the first data feed because they may be joined later.
// I'm using reduceByKeyAndWindow() instead of window() because I can build this "cache" incrementally.
// For a given key, appendString() simply appends new a new string to the value, while removeString() removes the strings (i.e. parts of the values) that go out of scope (i.e. out of WINDOW_INTERVAL)
JavaPairDStream<String, String> windowed_data = filtered_data.reduceByKeyAndWindow(appendString, removeString, Durations.seconds(WINDOW_DURATION), Durations.seconds(SLIDE_DURATION))
.flatMapValues(new Function<String, Iterable<String>>() {
#Override
public Iterable<String> call(String s) {
return Arrays.asList(s.split(","));
}
});
// This is a second data feed, which is also transformed to a K/V RDD for the join operation with the first feed
JavaDStream<String> second_stream = jssc.textFileStream(MSP_DIR);
JavaPairDStream<String, String> ss_kv = second_stream.mapToPair(new PairFunction<String, String, String>() {
#Override
public scala.Tuple2<String, String> call(String row) {
String[] el = row.split("\\|");
return new scala.Tuple2(el[9], row);
}
});
JavaPairDStream<String, String> joined_stream = ss_kv.join(windowed_data)
// Use foreachRDD() to save joined_stream to HDFS

spark streaming with saveastextfile giving mismatch record count

i am reading data from kinesis stream and then just store into HDFS using saveastextfile command. I got the output records but count was mismatch. Total input record count is 1000 and expecting the same in output. But always it written less than 100.
splitCSV.foreachRDD(new VoidFunction2<JavaRDD<String[]>,Time>()
{
public void call(JavaRDD<String[]> rdd, Time time) throws Exception
{
...
}
...saveAsTextFile("/Selva")
}

File is overwritten while using saveAsNewAPIHadoopFile

We are using Spark 1.4 for Spark Streaming. Kafka is data source for the Spark Stream.
Records are published on Kafka every second. Our requirement is to store records published on Kafka in a single folder per minute. The stream will read records every five seconds. For instance records published during 1200 PM and 1201PM are stored in folder "1200"; between 1201PM and 1202PM in folder "1201" and so on.
The code I wrote is as follows
//First Group records in RDD by date
stream.foreachRDD (rddWithinStream -> {
JavaPairRDD<String, Iterable<String>> rddGroupedByDirectory = rddWithinStream.mapToPair(t -> {
return new Tuple2<String, String> (targetHadoopFolder, t._2());
}).groupByKey();
// All records grouped by folders they will be stored in
// Create RDD for each target folder.
for (String hadoopFolder : rddGroupedByDirectory.keys().collect()) {
JavaPairRDD <String, Iterable<String>> rddByKey = rddGroupedByDirectory.filter(groupedTuples -> {
return groupedTuples._1().equals(hadoopFolder);
});
// And store it in Hadoop
rddByKey.saveAsNewAPIHadoopFile(directory, String.class, String.class, TextOutputFormat.class);
}
Since the Stream processes data every five seconds, saveAsNewAPIHadoopFile gets invoked multiple times in a minute. This causes "Part-00000" file to be overwritten every time.
I was expecting that in the directory specified by "directory" parameter, saveAsNewAPIHadoopFile will keep creating part-0000N file even when I've a sinlge worker node.
Any help/alternatives are greatly appreciated.
Thanks.
In this case you have to build your output path and filename by yourself. Incremental file naming works only when the output operation is called directly on DStream (not per each RDD).
The argument function in stream.foreachRDD can get Time information for each micro-batch. Referring to Spark documentation:
def foreachRDD(foreachFunc: (RDD[T], Time) ⇒ Unit)
So you can save each RDD as follows:
stream.foreachRDD((rdd, time) -> {
val directory = timeToDirName(prefix, time)
rdd.saveAsNewAPIHadoopFile(directory, String.class, String.class, TextOutputFormat.class);
})
You can try this -
Split process into 2 steps :
Step-1 :- Write Avro file using saveAsNewAPIHadoopFile to <temp-path>
Step-2 :- Move file from <temp-path> to <actual-target-path>
Hope this is helpful.

Resources