How to read streaming datasets from socket? - apache-spark

Below code reads from a socket, but I don't see any input going into the job. I have nc -l 1111 running, and dumping data though, not sure why my Spark job is not able to read data from 10.176.110.112:1111.
Dataset<Row> d = sparkSession.readStream().format("socket")
.option("host", "10.176.110.112")
.option("port", 1111).load();

Below code reads from a socket, but I don't see any input going into the job.
Well, honestly, you do not read anything from anywhere. You've only described what you are going to do when you start the streaming pipeline.
Since you use Structured Streaming to read datasets from a socket, you should use start operator to trigger data fetching (and that's only after you define the sink).
start(): StreamingQuery Starts the execution of the streaming query, which will continually output results to the given path as new data arrives. The returned StreamingQuery object can be used to interact with the stream.
Before start you should define where to stream your data. It could be Kafka, files, a custom streaming sink (perhaps using foreach operator) or console.
I use console sink (aka format) in the following example. I also use Scala and leave rewriting it to Java as your home exercise.
d.writeStream. // <-- this is the most important part
trigger(Trigger.ProcessingTime("10 seconds")).
format("console").
option("truncate", false).
start // <-- and this

Related

Spark Structured Streaming - How to ignore checkpoint?

I'm reading messages from Kafka stream using microbatching (readStream), processing them and writing results to another Kafka topic via writeStream. The job (streaming query) is designed to run "forever", processing microbatches of size 10 seconds (of processing time). The checkpointDirectory option is set, since Spark requires checkpointing.
However, when I try to submit another query with the same source stream (same topic etc.) but possibly different processing algorithm), Spark finishes the previous running query and creates a new one with the same ID (so it starts from the very same offset on which the previous job "finished").
How to tell Spark that the second job is different from the first one, so there is no need to restore from checkpoint (i.e. intended behaviour is to create a completely new streaming query not connected to previous one, and keep the previous one running)?
You can achieve independence of the two streaming queries by setting the checkpointLocation option in their respective writeStream call. You should not set the checkpoint location centrally in the SparkSession.
That way, they can run independently and will not interfere from each other.

Spark structured streaming window when no stream

I want to log number of records read to database from incoming stream of spark structured streaming. I'm using foreachbatch to transform incoming stream batch and write to desired location. I want to log 0 records read if there are no records in a particular hour. But foreach batch does not execute when there is no stream. Can anyone help me with it? My code is as below:
val incomingStream = spark.readStream.format("eventhubs")
.options(customEventhubParameters.toMap).load()
val query=incomingStream.writeStream.foreachBatch{
(batchDF: DataFrame, batchId: Long) =>
writeStreamToDataLake(batchDF,batchId,partitionColumn,
fileLocation,errorFilePath,eventHubName,configMeta)
}.option("checkpointLocation",fileLocation+checkpointFolder+"/"+eventHubName)
.trigger(Trigger.ProcessingTime(triggerTime.toLong))
.start().awaitTermination()
This is how it works and even mods, extensions to StreamingQueryListener are invoked only when there is something to process and thus status changes of the stream.
There probably is another way, but I would say "think outside of the box" and pre-popualte with 0 per timeframe such a database and when querying AGGRegate and you will have the correct answer.
https://medium.com/#johankok/structured-streaming-in-a-flash-576cdb17bbee can give some insight plus the Spark: The Definitive Guide.

How does Structured Streaming ensure exactly-once writing semantics for file sinks?

I am writing a storage writer for spark structured streaming which will partition the given dataframe and write to a different blob store account. The spark documentation says the it ensures exactly once semantics for file sinks but also says that the exactly once semantics are only possible if the source is re-playable and the sink is idempotent.
Is the blob store an idempotent sink if I write in parquet format?
Also how will the behavior change if I am doing streamingDF.writestream.foreachbatch(...writing the DF here...).start()? Will it still guarantee exactly once semantics?
Possible duplicate : How to get Kafka offsets for structured query for manual and reliable offset management?
Update#1 : Something like -
output
.writeStream
.foreachBatch((df: DataFrame, _: Long) => {
path = storagePaths(r.nextInt(3))
df.persist()
df.write.parquet(path)
df.unpersist()
})
Micro-Batch Stream Processing
I assume that the question is about Micro-Batch Stream Processing (not Continuous Stream Processing).
Exactly once semantics are guaranteed based on available and committed offsets internal registries (for the current stream execution, aka runId) as well as regular checkpoints (to persist processing state across restarts).
exactly once semantics are only possible if the source is re-playable and the sink is idempotent.
It is possible that whatever has already been processed but not recorded properly internally (see below) can be re-processed:
That means that all streaming sources in a streaming query should be re-playable to allow for polling for data that has once been requested.
That also means that the sink should be idempotent so the data that has been processed successfully and added to the sink may be added again because a failure happened just before Structured Streaming managed to record the data (offsets) as successfully processed (in the checkpoint)
Internals
Before the available data (by offset) of any of the streaming source or reader is processed, MicroBatchExecution commits the offsets to Write-Ahead Log (WAL) and prints out the following INFO message to the logs:
Committed offsets for batch [currentBatchId]. Metadata [offsetSeqMetadata]
A streaming query (a micro-batch) is executed only when there is new data available (based on offsets) or the last execution requires another micro-batch for state management.
In addBatch phase, MicroBatchExecution requests the one and only Sink or StreamWriteSupport to process the available data.
Once a micro-batch finishes successfully the MicroBatchExecution commits the available offsets to commits checkpoint and the offsets are considered processed already.
MicroBatchExecution prints out the following DEBUG message to the logs:
Completed batch [currentBatchId]
When you use foreachBatch, spark guarantee only that foreachBatch will call only one time. But if you will have exception during execution foreachBatch, spark will try to call it again for same batch. In this case we can have duplication if we store to multiple storages and have exception during storing.
So you can manually handle exception during storing for avoid duplication.
In my practice I created custom sink if need to store to multiple storage and use datasource api v2 which support commit.

Parquet File Output Sink - Spark Structured Streaming

Wondering what (and how to modify) triggers a Spark Sturctured Streaming Query (with Parquet File output sink configured) to write data to the parquet files. I periodically feed the Stream input data (using StreamReader to read in files), but it does not write output to Parquet file for each file provided as input. Once I have given it a few files, it tends to write a Parquet file just fine.
I am wondering how to control this. I would like to be able force a new write to Parquet file for every new file provided as input. Any tips appreciated!
Note: I have maxFilesPerTrigger set to 1 on the Read Stream call. I am also seeing the Streaming Query process the single input file, however a single file on input does not appear to result in the Streaming Query writing the output to the Parquet file
After further analysis, and working with the ForEach output sink using the default Append Mode, I believe the issue I was running into was the combination of the Append mode along with the Watermarking feature.
After re-reading https://spark.apache.org/docs/2.2.1/structured-streaming-programming-guide.html#starting-streaming-queries It appears that when the Append mode is used with a watermark set, the Spark structured steaming will not write out aggregation results to the Result table until the watermark time limit has past. Append mode does not allow updates to records, so it must wait for the watermark to past, to ensure no change the row...
I believe - the Parquet File sink does not allow for the Update mode, howver after switching to the ForEach output Sink, and using the Update mode, I observed data coming out the sink as I expected. Essentially for each record in, at least one record out, with no delay (as was observed before).
Hopefully this is helpful to others.

Use Spark to Write Kafka Messages Directly to a File

For a class project, I need a Spark Java program to listen as a Kafka consumer and write all of a Kafka topic's received messages to a file (e.g. "/user/zaydh/my_text_file.txt").
I am able to receive the messages in as a JavaPairReceiverInputDStream object; I can also convert it to a JavaDStream<String> (this is from the Spark Kafka example).
However, I could not find a good Java syntax to write this data to what is a essentially a single log file. I tried using foreachRDD on the JavaDStream object, but I could not find a clean, parallel-safe way to sink it to a single log file.
I understand this approach is non-traditional or non-ideal, but it is a requirement. Any guidance is much appreciated.
When you think of a stream , you got to think of it as something that wont stop giving out data .
Hence if Spark streaming had a way to save all the RDDs coming in to a single file , it would keep growing to a huge size (and the stream isnt supposed to stop remember ? :))
But in this case you can make use of the saveAsTextFile utility of an RDD,
Which will create many file in your output directory depending on your batch interval thats specified while creating the streaming context JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1))
You can then merge these fileparts into one using somthing like how-to-merge-all-text-files-in-a-directory-into-one

Resources