How to load tar.gz files in streaming datasets? - apache-spark

I would like to do streaming from tar-gzip files (tgz) which include my actual CSV stored data.
I already managed to do structured streaming with spark 2.2 when my data comes in as CSV files, but actually, the data comes in as gzipped csv files.
Is there a way that the trigger done by structured streaming does an decompress before handling the CSV stream?
The code I use to process the files is this:
val schema = Encoders.product[RawData].schema
val trackerData = spark
.readStream
.option("delimiter", "\t")
.schema(schema)
.csv(path)
val exceptions = rawCientData
.as[String]
.flatMap(extractExceptions)
.as[ExceptionData]
produced output as expected when path points to csv files.
But I would like to use tar gzip files.
When I try to place those files at the given path, I do not get any exceptions and batch output tells me
"sources" : [ {
"description" : "FileStreamSource[file:/Users/matthias/spark/simple_spark/src/main/resources/zsessionlog*]",
"startOffset" : null,
"endOffset" : {
"logOffset" : 0
},
"numInputRows" : 1095,
"processedRowsPerSecond" : 211.0233185584891
} ],
But I do not get any actual data processed.
Console sink looks like this:
+------+---+-----+
|window|id |count|
+------+---+-----+
+------+---+-----+

I solved the part of reading .tar.gz (.tgz) files this way:
Inspired by this site I created my own TGZ codec
final class DecompressTgzCodec extends CompressionCodec {
override def getDefaultExtension: String = ".tgz"
override def createOutputStream(out: OutputStream): CompressionOutputStream = ???
override def createOutputStream(out: OutputStream, compressor: Compressor): CompressionOutputStream = ???
override def createCompressor(): Compressor = ???
override def getCompressorType: Class[_ <: Compressor] = ???
override def createInputStream(in: InputStream): CompressionInputStream = {
new TarDecompressorStream(new TarArchiveInputStream(new GzipCompressorInputStream(in)))
}
override def createInputStream(in: InputStream, decompressor: Decompressor): CompressionInputStream = createInputStream(in)
override def createDecompressor(): Decompressor = null
override def getDecompressorType: Class[_ <: Decompressor] = null
final class TarDecompressorStream(in: TarArchiveInputStream) extends DecompressorStream(in) {
def updateStream(): Unit = {
// still have data in stream -> done
if (in.available() <= 0) {
// create stream content from following tar elements one by one
in.getNextTarEntry()
}
}
override def read: Int = {
checkStream()
updateStream()
in.read()
}
override def read(b: Array[Byte], off: Int, len: Int): Int = {
checkStream()
updateStream()
in.read(b, off, len)
}
override def resetState(): Unit = {}
}
}
And registered it for use by spark.
val conf = new SparkConf()
conf.set("spark.hadoop.io.compression.codecs", classOf[DecompressTgzCodec].getName)
val spark = SparkSession
.builder()
.master("local[*]")
.config(conf)
.appName("Streaming Example")
.getOrCreate()
Works exactly like I wanted it to do.

I do not think reading tar.gz'ed files is possible in Spark (see Read whole text files from a compression in Spark or gzip support in Spark for some ideas).
Spark does support gzip files, but they are not recommended as not splittable and result in a single partition (that in turn makes Spark of little to no help).
In order to have gzipped files loaded in Spark Structured Streaming you have to specify the path pattern so the files are included in loading, say zsessionlog*.csv.gz or alike. Else, csv alone loads CSV files only.
If you insist on Spark Structured Streaming to handle tar.gz'ed files, you could write a custom streaming data Source to do the un-tar.gz.
Given gzip files are not recommended as data format in Spark, the whole idea of using Spark Structured Streaming does not make much sense.

Related

Spark Structured Streaming program that reads from non-empty Kafka topic (starting from earliest) triggers batches locally, but not on EMR cluster

We are running into a problem where -- for one of our applications --
we don't see any evidences of batches being processed in the Structured
Streaming tab of the Spark UI.
I have written a small program (below) to reproduce the issue.
A self-contained project that allows you to build the app, along with scripts that facilitate upload to AWS, and details on how to run and reproduce the issue can be found here: https://github.com/buildlackey/spark-struct-streaming-metrics-missing-on-aws (The github version of the app is a slightly evolved version of what is presented below, but it illustrates the problem of Spark streaming metrics not showing up.)
The program can be run 'locally' -- on someones' laptop in local[*] mode (say with a dockerized Kafka instance),
or on an EMR cluster. For local mode operation you invoke the main method with 'localTest' as the first
argument.
In our case, when we run on the EMR cluster, pointing to a topic
where we know there are many data records (we read from 'earliest'), we
see that THERE ARE INDEED NO BATCHES PROCESSED -- on the cluster for some reason...
In the local[*] case we CAN see batches processed.
To capture evidence of this i wrote a forEachBatch handler that simply does a
toLocalIterator.asScala.toList.mkString("\n") on the Dataset of each batch, and then dumps the
resultant string to a file. Running locally.. i see evidence of the
captured records in the temporary file. HOWEVER, when I run on
the cluster and i ssh into one of the executors i see NO SUCH
files. I also checked the master node.... no files matching the pattern 'Missing'
So... batches are not triggering on the cluster. Our kakfa has plenty of data and
when running on the cluster the logs show we are churning through messages at increasing offsets:
21/12/16 05:15:21 DEBUG KafkaDataConsumer: Get spark-kafka-source-blah topic.foo.event-18 nextOffset 4596542913 requested 4596542913
21/12/16 05:15:21 DEBUG KafkaDataConsumer: Get spark-kafka-source-blah topic.foo.event-18 nextOffset 4596542914 requested 4596542914
Note to get the logs we are using:
yarn yarn logs --applicationId <appId>
which should get both driver and executor logs for the entire run (when app terminates)
Now, in the local[*] case we CAN see batches processed. The evidence is that we see a file whose name
is matching the pattern 'Missing' in our tmp folder.
I am including my simple demo program below. If you can spot the issue and clue us in, I'd be very grateful !
// Please forgive the busy code.. i stripped this down from a much larger system....
import com.typesafe.scalalogging.StrictLogging
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import org.apache.spark.sql.{Dataset, SparkSession}
import java.io.File
import java.util
import scala.collection.JavaConverters.asScalaIteratorConverter
import scala.concurrent.duration.Duration
object AwsSupportCaseFailsToYieldLogs extends StrictLogging {
case class KafkaEvent(fooMsgKey: Array[Byte],
fooMsg: Array[Byte],
topic: String,
partition: String,
offset: String) extends Serializable
case class SparkSessionConfig(appName: String, master: String) {
def sessionBuilder(): SparkSession.Builder = {
val builder = SparkSession.builder
builder.master(master)
builder
}
}
case class KafkaConfig(kafkaBootstrapServers: String, kafkaTopic: String, kafkaStartingOffsets: String)
def sessionFactory: (SparkSessionConfig) => SparkSession = {
(sparkSessionConfig) => {
sparkSessionConfig.sessionBuilder().getOrCreate()
}
}
def main(args: Array[String]): Unit = {
val (sparkSessionConfig, kafkaConfig) =
if (args.length >= 1 && args(0) == "localTest") {
getLocalTestConfiguration
} else {
getRunOnClusterConfiguration
}
val spark: SparkSession = sessionFactory(sparkSessionConfig)
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val dataSetOfKafkaEvent: Dataset[KafkaEvent] = spark.readStream.
format("kafka").
option("subscribe", kafkaConfig.kafkaTopic).
option("kafka.bootstrap.servers", kafkaConfig.kafkaBootstrapServers).
option("startingOffsets", kafkaConfig.kafkaStartingOffsets).
load.
select(
$"key" cast "binary",
$"value" cast "binary",
$"topic",
$"partition" cast "string",
$"offset" cast "string").map { row =>
KafkaEvent(
row.getAs[Array[Byte]](0),
row.getAs[Array[Byte]](1),
row.getAs[String](2),
row.getAs[String](3),
row.getAs[String](4))
}
val initDF = dataSetOfKafkaEvent.map { item: KafkaEvent => item.toString }
val function: (Dataset[String], Long) => Unit =
(dataSetOfString, batchId) => {
val iter: util.Iterator[String] = dataSetOfString.toLocalIterator()
val lines = iter.asScala.toList.mkString("\n")
val outfile = writeStringToTmpFile(lines)
println(s"writing to file: ${outfile.getAbsolutePath}")
logger.error(s"writing to file: ${outfile.getAbsolutePath} / $lines")
}
val trigger = Trigger.ProcessingTime(Duration("1 second"))
initDF.writeStream
.foreachBatch(function)
.trigger(trigger)
.outputMode("append")
.start
.awaitTermination()
}
private def getLocalTestConfiguration: (SparkSessionConfig, KafkaConfig) = {
val sparkSessionConfig: SparkSessionConfig =
SparkSessionConfig(master = "local[*]", appName = "dummy2")
val kafkaConfig: KafkaConfig =
KafkaConfig(
kafkaBootstrapServers = "localhost:9092",
kafkaTopic = "test-topic",
kafkaStartingOffsets = "earliest")
(sparkSessionConfig, kafkaConfig)
}
private def getRunOnClusterConfiguration = {
val sparkSessionConfig: SparkSessionConfig = SparkSessionConfig(master = "yarn", appName = "AwsSupportCase")
val kafkaConfig: KafkaConfig =
KafkaConfig(
kafkaBootstrapServers= "kafka.foo.bar.broker:9092", // TODO - change this for kafka on your EMR cluster.
kafkaTopic= "mongo.bongo.event", // TODO - change this for kafka on your EMR cluster.
kafkaStartingOffsets = "earliest")
(sparkSessionConfig, kafkaConfig)
}
def writeStringFile(string: String, file: File): File = {
java.nio.file.Files.write(java.nio.file.Paths.get(file.getAbsolutePath), string.getBytes).toFile
}
def writeStringToTmpFile(string: String, deleteOnExit: Boolean = false): File = {
val file: File = File.createTempFile("streamingConsoleMissing", "sad")
if (deleteOnExit) {
file.delete()
}
writeStringFile(string, file)
}
}
I have encountered similar issue, maxOffsetsPerTrigger would fix the issue. Actually, it's not issue.
All logs and metrics per batch are only printed or showing after
finish of this batch. That's the reason why you can't see the job make
progress.
If maxOffsetsPerTrigger can't solve the issue, you could try to consume from latest offset to confirm the procssing logic is correct.
This is a provisional answer. One of our team members has a theory that looks pretty likely. Here it is: Batches ARE getting processed (this is demonstrated better by the version of the program I linked to on github), but we are thinking that since there is so much backed up in the topic on our cluster that the processing (from earliest) of the first batch takes a very long time, hence when looking at the cluster we see zero batches processed... even though there is clearly work being done. It might be that the solution is to use maxOffsetsPerTrigger to gate the amount of incoming traffic (when starting from earliest and working w/ a topic that has huge volumes of data). We are working on confirming this.

Working with Protobuf-encoded MQTT streams in Apache Beam

I am trying to decode and process protobuf-encoded MQTT messages (from an Eclipse Mosquitto broker) using Apache Beam. In addition to the encoded fields, I also want to process the full topic of each message for grouping and aggregations, as well as the timestamp.
What I have tried so far
I can connect to Mosquitto via
val options = PipelineOptionsFactory.create()
val pipeline = Pipeline.create(options)
val mqttReader: MqttIO.Read = MqttIO
.read()
.withConnectionConfiguration(
MqttIO.ConnectionConfiguration.create(
"tcp://localhost:1884",
"my/topic/+"
)
)
val readMessages = pipeline.apply<PCollection<ByteArray>>(mqttReader)
In order to decode the messages, I have compiled the .proto schema (in my case quote.proto containing the Quote message) via Gradle, which allows my to transform ByteArray into Quote objects via Quote.parseFrom():
val quotes = readMessages
.apply(
ParDo.of(object : DoFn<ByteArray, QuoteOuterClass.Quote>() {
#ProcessElement
fun processElement(context: ProcessContext) {
val protoRow = context.element()
context.output(QuoteOuterClass.Quote.parseFrom(protoRow))
}
})
)
Using this, in the next apply, I can then access individual fields with a ProcessFunction and a lambda, e.g. { quote -> "${quote.volume}" }. However, there are two problems:
With this pipeline I do not have access to the topic or timestamp of each message.
After sending the decoded messages back to the broker with plain UTF8 encoding, I believe that they do not get decoded correctly.
Additional considerations
Apache Beam provides a ProtoCoder class, but I cannot figure out how to use it in conjunction with MqttIO. I suspect that the implementation has to look similar to
val coder = ProtoCoder
.of(QuoteOuterClass.Quote::class.java)
.withExtensionsFrom(QuoteOuterClass::class.java)
Instead of a PCollection<ByteArray>, the Kafka IO reader provides a PCollection<KafkaRecord<Long, String>>, which has all the relevant fields (including topic). I am wondering if something similar can be achieved with Mqtt + ProtoBuf.
A similar implementation to what I want to achieve can be done in Spark Structured Streaming + Apache Bahir as follows:
val df_mqttStream = spark.readStream
.format("org.apache.bahir.sql.streaming.mqtt.MQTTStreamSourceProvider")
.option("topic", topic)
.load(brokerUrl)
val parsePayload = ProtoSQL.udf { bytes: Array[Byte] => Quote.parseFrom(bytes) }
val quotesDS = df_mqttStream.select("id", "topic", "payload")
.withColumn("quote", parsePayload($"payload"))
.select("id", "topic", "quote.*")
However, with Spark 2.4 (the latest supported version), accessing the message topic is broken (related issue, my ticket in Apache Jira).
From my understanding, the latest version of Apache Beam (2.27.0) does simply not offer a way to extract the specific topics of MQTT messages.
I have extended the MqttIO to return MqttMessage objects that include a topic (and a timestamp) in addition to the byte array payload. The changes currently exist as a pull request draft.
With these changes, the topic can simply be accessed as message.topic.
val readMessages = pipeline.apply<PCollection<MqttMessage>>(mqttReader)
val topicOfMessages: PCollection<String> = mqttMessages
.apply(
ParDo.of(object : DoFn<MqttMessage, String>() {
#ProcessElement
fun processElement(
#Element message: MqttMessage,
out: OutputReceiver<String>
) { out.output(message.topic) }
})
)

Spark kafka streaming - how to determine end of a batch

I am consuming from a Kafka topic by using Kafka Streaming. (kafka direct stream)
The data in this topic arrives after every 5 minutes from another source.
Now i need to process the data that arrives after every 5 minutes and convert that into a Spark DataFrame.
Now, stream is continuous flow of data.
My issue is , how do i determine that i am done reading first set of data that was loaded in Kafka topic? (So that i can convert that into DataFrame and start my work)
I know i can mention the batch interval( in JavaStreamingContext) to a certain number, but even then i can never be sure on how much time the source will take to push the data to the topic.
Any suggestions are welcome.
If I understand your question correctly you would like to not create a batch till all the data for the 5 minute worth of input is read.
Spark out of the box does not provide any API like that.
You can how ever use a sliding window on your received stream to achieve part of what you want. (See last example code)
The other way(harder way) is to implement your own org.apache.spark.streaming.util.ManualClock to achieve what you need.
ManualClock is a private class so the override happens within the name space.
package org.apache.spark.streaming
import org.apache.spark.util.ManualClock
object ClockWrapper {
def advance(ssc: StreamingContext, timeToAdd: Duration): Unit = {
val manualClock = ssc.scheduler.clock.asInstanceOf[ManualClock]
manualClock.advance(timeToAdd.milliseconds)
}
}
Then in your own class
import org.apache.spark.streaming.{ClockWrapper, Duration, Seconds, StreamingContext}
//elided.
override def sparkConfig: Map[String, String] = {
super.sparkConfig + ("spark.streaming.clock" -> "org.apache.spark.streaming.util.ManualClock")
}
def ssc: StreamingContext = _ssc
def advanceClock(timeToAdd: Duration): Unit = {
//Only if some other conditions are met..
ClockWrapper.advance(_ssc, timeToAdd)
}
def advanceClockOneBatch(): Unit = {
advanceClock(Duration(batchDuration.milliseconds))
}
State based stream management can be done by using mapWithState API.
object StatefulStreamOperation {
val sparkConf = new SparkConf().setAppName("")
// Create the context with a 1 second batch size
val ssc = new StreamingContext(sparkConf, Seconds(1))
ssc.checkpoint(".")
val incoming: DStream[(batchTimeMultipleOf5Minute, UserClass)]
val mappingFunc = (key: batchTime, incoming: Option[Int], state: State[UserClass]) => {
//Do what ever you need to do to the data.
//say (result, newState) = Some_Cool_operation(incoming, state)
state.update(newState)
result
}
val stateDstream = dataDstream.mapWithState(
StateSpec.function(mappingFunc).initialState(initialRDD))
//Do something with the result.
ssc.start()
ssc.awaitTermination()
}
}

How can I write a Dataset<Row> into kafka output topic on Spark Structured Streaming - Java8

I am trying to use ForeachWriter interface in Spark 2.1 it's interface, but I cannot use it.
It will be supported in Spark 2.2.0. To learn how to use it, I suggest you read this blog post: https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html
You can try Spark 2.2.0 RC2 [1] or just wait for the final release.
Another option is taking a look at this blog if you cannot use Spark 2.2.0+:
https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html
It has a very simple Kafka sink and maybe that's enough for you.
[1] http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Apache-Spark-2-2-0-RC2-td21497.html
First thing to know is that, if you working with spark structured Stream and processing streaming data, you'll be having a streaming Dataset.
Being said, the way to write this streaming Dataset is by calling the ForeachWriter, you got it right..
import org.apache.spark.sql.ForeachWriter
val writer = new ForeachWriter[Commons.UserEvent] {
override def open(partitionId: Long, version: Long) = true
override def process(value: Commons.UserEvent) = {
processRow(value)
}
override def close(errorOrNull: Throwable) = {}
}
val query =
ds.writeStream.queryName("aggregateStructuredStream").outputMode("complete").foreach(writer).start
And the function that writes into topic will be like:
private def processRow(value: Commons.UserEvent) = {
/*
* Producer.send(topic, data)
*/
}

documentation about the file format of spark rdd.saveAsObjectFile

Spark can save a rdd to a file with rdd.saveAsObjectFile("file").
I need to read this file outside Spark. According to doc, using the default spark serializer, this file is just a sequence of objects serialized with the standard Java serialization. However, I guess the file has a header and a separator between objects. I need to read this file, and use jdeserialize to deserialize each Java/Scala object (as I don't have the class definition).
Where can I find the documentation about the file format produced by rdd.saveAsObjectFile("file") (with the standard serializer, not Kryo serializer)?
Update
Working example based on VladoDemcak answer:
import org.apache.hadoop.io._
import org.apache.hadoop.conf._
import org.apache.hadoop.fs._
import org.apache.hadoop.io._
def deserialize(data: Array[Byte]) =
new ObjectInputStream(new ByteArrayInputStream(data)).readObject()
val path = new Path("/tmp/part-00000")
val config = new Configuration()
val reader = new SequenceFile.Reader(FileSystem.get(new Configuration()), path, config)
val key = NullWritable.get
val value = new BytesWritable
while (reader.next(key, value)) {
println("key: {} and value: {}.", key, value.getBytes)
println(deserialize(value.getBytes()))
}
reader.close()
It is very interesting question so I will try to explain what I know about this staff. You can check saveAsObjectFile and only documentation I saw about some details is API javadoc
/**
* Save this RDD as a SequenceFile of serialized objects.
*/
def saveAsObjectFile(path: String): Unit = withScope {
this.mapPartitions(iter => iter.grouped(10).map(_.toArray))
.map(x => (NullWritable.get(), new BytesWritable(Utils.serialize(x))))
.saveAsSequenceFile(path)
}
so as I know saveAsObjectFile produces SequenceFile. And based on documentation for sequenceFile it has header with version, classname, metadata ...
There are 3 different SequenceFile formats:
Uncompressed key/value records. Record compressed key/value records -
only 'values' are compressed here. Block compressed key/value records
- both keys and values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable.
all of the above formats share a common header (which is used by the
SequenceFile.Reader to return the appropriate key/value pairs).
For reading sequencefile we can use hadoop SequenceFile.Reader implementation.
Path path = new Path("/hdfs/file/path/seqfile");
SequenceFile.Reader reader = new SequenceFile.Reader(FileSystem.get(new Configuration()), path, config);
WritableComparable key = (WritableComparable) reader.getKeyClass().newInstance();
Writable value = (Writable) reader.getValueClass().newInstance();
while (reader.next(key, value)){
logger.info("key: {} and value: {}.", key, value.getBytes());
// (MyObject) deserialize(value.getBytes());
}
reader.close();
I have not tested this but based on doc link you noticed in your question:
By default, Spark serializes objects using Java’s ObjectOutputStream
framework
so in loop you can get bytes for value and deserialize with ObjectInputStream
public static Object deserialize(byte[] data){
return new ObjectInputStream(new ByteArrayInputStream(data)).readObject();
}
in your case you need to use your library (jdeserialize) in deserialize method - i guess run(InputStream is, boolean shouldConnect) etc.

Resources