File is overwritten while using saveAsNewAPIHadoopFile - apache-spark

We are using Spark 1.4 for Spark Streaming. Kafka is data source for the Spark Stream.
Records are published on Kafka every second. Our requirement is to store records published on Kafka in a single folder per minute. The stream will read records every five seconds. For instance records published during 1200 PM and 1201PM are stored in folder "1200"; between 1201PM and 1202PM in folder "1201" and so on.
The code I wrote is as follows
//First Group records in RDD by date
stream.foreachRDD (rddWithinStream -> {
JavaPairRDD<String, Iterable<String>> rddGroupedByDirectory = rddWithinStream.mapToPair(t -> {
return new Tuple2<String, String> (targetHadoopFolder, t._2());
}).groupByKey();
// All records grouped by folders they will be stored in
// Create RDD for each target folder.
for (String hadoopFolder : rddGroupedByDirectory.keys().collect()) {
JavaPairRDD <String, Iterable<String>> rddByKey = rddGroupedByDirectory.filter(groupedTuples -> {
return groupedTuples._1().equals(hadoopFolder);
});
// And store it in Hadoop
rddByKey.saveAsNewAPIHadoopFile(directory, String.class, String.class, TextOutputFormat.class);
}
Since the Stream processes data every five seconds, saveAsNewAPIHadoopFile gets invoked multiple times in a minute. This causes "Part-00000" file to be overwritten every time.
I was expecting that in the directory specified by "directory" parameter, saveAsNewAPIHadoopFile will keep creating part-0000N file even when I've a sinlge worker node.
Any help/alternatives are greatly appreciated.
Thanks.

In this case you have to build your output path and filename by yourself. Incremental file naming works only when the output operation is called directly on DStream (not per each RDD).
The argument function in stream.foreachRDD can get Time information for each micro-batch. Referring to Spark documentation:
def foreachRDD(foreachFunc: (RDD[T], Time) ⇒ Unit)
So you can save each RDD as follows:
stream.foreachRDD((rdd, time) -> {
val directory = timeToDirName(prefix, time)
rdd.saveAsNewAPIHadoopFile(directory, String.class, String.class, TextOutputFormat.class);
})

You can try this -
Split process into 2 steps :
Step-1 :- Write Avro file using saveAsNewAPIHadoopFile to <temp-path>
Step-2 :- Move file from <temp-path> to <actual-target-path>
Hope this is helpful.

Related

Which is the best way that Apache Spark DStream joins with static data records from the HDFS sequence file?

My goal is to monitor user input using SparkStreaming. The user's input is DStream and is just a key to data record (short string). The program needs filter and read the static data set (very big RDD, bigRDD) from HDFS Sequence File (a single record 30MB, the entire data set is about 10,000 records) by the user-entered key value. Then the program calculates the bigRDD and return the result records (30MB each) to the user. I hope that the calculation of bigRDD will be distributed locally as much as possible, avoid data transmission on the network, and use persist to reduce hard disk IO time. How should the specific steps be designed?
I tried:
JavaStreamingContext jsc = new JavaStreamingContext(...) ;
JavaDStream<String> lines = jsc.socketTextStream(...) ;
seqRDD = jsc.sparkContext().sequenceFile(...);// RDD from sequence file can not cache.
bigRDD = pairRdd.mapToPair(...) ;// bigRDD is used for cache.
bigRDD.cache() ;
inputDStream = lines.mapToPair(...) ; // convert DStream<string> to PairDStream<string,string> for join.
inputDStream.foreachRDD (inputRdd-> {
bigRDD2 = inputRdd.join(bigRDD);
resultRDD = bigRDD2.map( ... do calculation ... );
send_result_to_user(resultRDD) ;
})
But I don't know if those steps are appropriate?
I will try broadcast the data from DStream.RDD.collection() every batch and use RDD mapPartitions to process the data.

Save as Text File giving the S3 URL on an RDD is taking 5 x time more than on HDFS

I have the following piece of code to save a file on S3
rdd
//drop header
.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }
//assign the Key for PartitionBy
//if the Key doesnt exist assign -1 which means all the data goes to part-00000 File
.map(line => if (colIndex == -1) (null, line) else (line.split(TILDE)(colIndex), line))
.partitionBy(customPartitioner)
.map { case (_, line) => line }
//Add Empty columns and Change the order and get the modified string
.map(line => addEmptyColumns(line, schemaIndexArray))
.saveAsTextFile(s"s3a://$bucketName/$serviceName/$folderPath")
For HDFS , there is no S3 path and the code takes 1/5th of time. Any other approaches on how to fix this?. I am setting the hadoop configuration in spark.
S3 is an object store and not a file system, hence the issues arising out of eventual consistency, non-atomic rename operations i.e., every time the executors writes the result of the job, each of them write to a temporary directory outside the main directory (on S3) where the files had to be written and once all the executors are done a rename is done to get atomic exclusivity. This is all fine in a standard filesystem like hdfs where renames are instantaneous but on an object store like S3, this is not conducive as renames on S3 are done at 6MB/s.
To overcome above problem, ensure setting the following two conf parameters
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
spark.speculation=false
For more details on above parameters, refer the following answer
How to tune spark job on EMR to write huge data quickly on S3
Reference article:
https://medium.com/#subhojit20_27731/apache-spark-and-amazon-s3-gotchas-and-best-practices-a767242f3d98

How to handle small file problem in spark structured streaming?

I have a scenario in my project , where I am reading the kafka topic messages using spark-sql-2.4.1 version. I am able to process the day using structured streaming. Once the data is received and after processed I need to save the data into respective parquet files in hdfs store.
I am able to store and read parquet files, I kept a trigger time of 15 seconds to 1 minutes. These files are very small in size hence resulting into many files.
These parquet files need to be read latter by hive queries.
So
1) Is this strategy works in production environment ? or does it lead to any small file problem later ?
2) What are the best practices to handle/design this kind of scenario i.e. industry standard ?
3) How these kind of things generally handled in Production?
Thank you.
I know this question is too old. I had similar problem & I have used spark structured streaming query listeners to solve this problem.
My use case is fetching data from kafka & storing in hdfs with year, month, day & hour partitions.
Below code will take previous hour partition data, apply repartitioning & overwrite data in existing partition.
val session = SparkSession.builder().master("local[2]").enableHiveSupport().getOrCreate()
session.streams.addListener(AppListener(config,session))
class AppListener(config: Config,spark: SparkSession) extends StreamingQueryListener {
override def onQueryStarted(event: StreamingQueryListener.QueryStartedEvent): Unit = {}
override def onQueryProgress(event: StreamingQueryListener.QueryProgressEvent): Unit = {
this.synchronized {AppListener.mergeFiles(event.progress.timestamp,spark,config)}
}
override def onQueryTerminated(event: StreamingQueryListener.QueryTerminatedEvent): Unit = {}
}
object AppListener {
def mergeFiles(currentTs: String,spark: SparkSession,config:Config):Unit = {
val configs = config.kafka(config.key.get)
if(currentTs.datetime.isAfter(Processed.ts.plusMinutes(5))) {
println(
s"""
|Current Timestamp : ${currentTs}
|Merge Files : ${Processed.ts.minusHours(1)}
|
|""".stripMargin)
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val ts = Processed.ts.minusHours(1)
val hdfsPath = s"${configs.hdfsLocation}/year=${ts.getYear}/month=${ts.getMonthOfYear}/day=${ts.getDayOfMonth}/hour=${ts.getHourOfDay}"
val path = new Path(hdfsPath)
if(fs.exists(path)) {
val hdfsFiles = fs.listLocatedStatus(path)
.filter(lfs => lfs.isFile && !lfs.getPath.getName.contains("_SUCCESS"))
.map(_.getPath).toList
println(
s"""
|Total files in HDFS location : ${hdfsFiles.length}
| ${hdfsFiles.length > 1}
|""".stripMargin)
if(hdfsFiles.length > 1) {
println(
s"""
|Merge Small Files
|==============================================
|HDFS Path : ${hdfsPath}
|Total Available files : ${hdfsFiles.length}
|Status : Running
|
|""".stripMargin)
val df = spark.read.format(configs.writeFormat).load(hdfsPath).cache()
df.repartition(1)
.write
.format(configs.writeFormat)
.mode("overwrite")
.save(s"/tmp${hdfsPath}")
df.cache().unpersist()
spark
.read
.format(configs.writeFormat)
.load(s"/tmp${hdfsPath}")
.write
.format(configs.writeFormat)
.mode("overwrite")
.save(hdfsPath)
Processed.ts = Processed.ts.plusHours(1).toDateTime("yyyy-MM-dd'T'HH:00:00")
println(
s"""
|Merge Small Files
|==============================================
|HDFS Path : ${hdfsPath}
|Total files : ${hdfsFiles.length}
|Status : Completed
|
|""".stripMargin)
}
}
}
}
def apply(config: Config,spark: SparkSession): AppListener = new AppListener(config,spark)
}
object Processed {
var ts: DateTime = DateTime.now(DateTimeZone.forID("UTC")).toDateTime("yyyy-MM-dd'T'HH:00:00")
}
Sometime data is huge & I have divided data into multiple files using below logic. File size will be around ~160 MB
val bytes = spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats(spark.sessionState.conf).sizeInBytes
val dataSize = bytes.toLong
val numPartitions = (bytes.toLong./(1024.0)./(1024.0)./(10240)).ceil.toInt
df.repartition(if(numPartitions == 0) 1 else numPartitions)
.[...]
Edit-1
Using this - spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats(spark.sessionState.conf).sizeInBytes we can get the size of actual Dataframe once its loaded into memory, for example you can check below code.
scala> val df = spark.read.format("orc").load("/tmp/srinivas/")
df: org.apache.spark.sql.DataFrame = [channelGrouping: string, clientId: string ... 75 more fields]
scala> import org.apache.commons.io.FileUtils
import org.apache.commons.io.FileUtils
scala> val bytes = spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats(spark.sessionState.conf).sizeInBytes
bytes: BigInt = 763275709
scala> FileUtils.byteCountToDisplaySize(bytes.toLong)
res5: String = 727 MB
scala> import sys.process._
import sys.process._
scala> "hdfs dfs -ls -h /tmp/srinivas/".!
Found 2 items
-rw-r----- 3 svcmxns hdfs 0 2020-04-20 01:46 /tmp/srinivas/_SUCCESS
-rw-r----- 3 svcmxns hdfs 727.4 M 2020-04-20 01:46 /tmp/srinivas/part-00000-9d0b72ea-f617-4092-ae27-d36400c17917-c000.snappy.orc
res6: Int = 0
We had a similar problem, too. After a lot of Googling, it seemed the generally accepted way was to write another job that every so often aggregates the many small files and writes them elsewhere in larger, consolidated files. This is what we now do.
As an aside: there is a limit to what you can do here anyway as the more parallelism you have, the greater the number of files because each executor thread writes to its own file. They never write to a shared file. This appears to be the nature of the beast that is parallel processing.
This is a common burning question of spark streaming with no any fixed answer.
I took an unconventional approach which is based on idea of append.
As you are using spark 2.4.1, this solution will be helpful.
So, if append were supported in columnar file format like parquet or orc, it would have been just easier as the new data could be appended in same file and file size can get on bigger and bigger after every micro-batch.
However, as it is not supported, I took versioning approach to achieve this. After every micro-batch, the data is produced with a version partition.
e.g.
/prod/mobility/cdr_data/date=01–01–2010/version=12345/file1.parquet
/prod/mobility/cdr_data/date=01–01–2010/version=23456/file1.parquet
What we can do is that, in every micro-batch, read the old version data, union it with the new streaming data and write it again at the same path with new version. Then, delete old versions. In this way after every micro-batch, there will be a single version and single file in every partition. The size of files in each partition will keep on growing and get bigger.
As union of streaming dataset and static dataset isn't allowed, we can use forEachBatch sink (available in spark >=2.4.0) to convert streaming dataset to static dataset.
I have described how to achieve this optimally in the link. You might want to have a look.
https://medium.com/#kumar.rahul.nitk/solving-small-file-problem-in-spark-structured-streaming-a-versioning-approach-73a0153a0a
You can set a trigger.
df.writeStream
.format("parquet")
.option("checkpointLocation", "path/to/checkpoint/dir")
.option("path", "path/to/destination/dir")
.trigger(Trigger.ProcessingTime("30 seconds"))
.start()
The larger the trigger size, the larger the file size.
Or optionally you could run the job with a scheduler(e.g. Airflow) and a trigger Trigger.Once() or better Trigger.AvailableNow(). It runs a the job only once a period and process all data with appropriate file size.

How to save data to process later after stopping DirectStream in SparkStreaming?

I am creating below KafkaDirectStream.
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams))
Then saving the values as :
val lines = messages.map(_.value)
Then stoping the streaming context when I have no further offset to consume as follows:
lines.foreachRDD(rdd => {
if(rdd.isEmpty()) {
messages.stop()
ssc.stop(false)
} else {
}
})
Then I am printing the lines as follows:
lines.print()
Then I am starting stream as:
ssc.start()
It is working fine. It reads rdds and prints top 10 and stops messages stream and stop streaming context. But then when I execute the same line lines.print() it throws an exception saying cannot do new inputs, transform, or outputs after stoping streamingContext.
How do I achieve my goal? I am running it in a spark-shell not as a binary (mandatory requirement).
Here is what I actually want to achieve:
1) Consume all json records from the kafka topic.
2) Stop getting further records (It is guarenteed that after consuming, there won't be no new records added to Kafka topic, so don't want to keep proessing no records.)
3) Do some preprocessing by extracting some fields from the JSON fields.
4) Do further operation on the preprocessed data.
5) Done.
when you are calling "lines.print()" again, its trying to call the transformation "messages.map(_.value)" again. As you stopped the context its failing.
Save the lines variable by performing an action before stopping the context.

Spark streaming creates one task per input file

I am processing sequence of input files with Spark streaming.
Spark streaming creates one task per input file and corresponding no of partitions and output part files.
JavaPairInputDStream<Text, CustomDataType> myRDD =
jssc.fileStream(path, Text.class, CustomDataType.class, SequenceFileInputFormat.class,
new Function<Path, Boolean>() {
#Override
public Boolean call(Path v1) throws Exception {
return Boolean.TRUE;
}
}, false);
For example if there are 100 input files in an interval.
Then there will be 100 part files in the output file.
What each part file represents?
(output from a task)
How to get reduce the no of output files (2 or 4 ...)?
Does this depend on no of partitioners?
Each file represents an RDD partition. If you want to reduce the number of partitions you can call repartition or coalesce with the number of partitions you wish to have.
https://spark.apache.org/docs/1.3.1/programming-guide.html#transformations

Resources