How to write Spark streaming calculated results to HDFS? - apache-spark

I am writing Spark streaming job and my batch window is 1 min. At regular intervals of 30 mins i want to write something to HDFS.
Can i do that in Spark streaming ?
If yes , How ?
I dont want to write in each Spark streaming batch as it will be too many files on HDFS.
I am getting input stream , I am adding only records which I have not seen earlier to RDD (or Dataframe) and then in the end after 30 mins interval i want to write that to HDFS.
The current solution in my mind is
Use updateStateByKey
Use Checkpoint with huge interval
Just wondering what the standard pattern is in such use cases.
Thanks,

Related

Spark Streaming: Many queued batches after a long time running without problems

We wrote a Spark Streaming application, that receives Kafka messages (backpressure enabled and spark.streaming.kafka.maxRatePerPartition set), maps the DStream into a Dataset and writes this datasets to Parquet files (inside DStream.foreachRDD) at the end of every batch.
At the beginning, everything seems fine, Spark Streaming processing time is around 10 seconds for a 30 second batch interval. The amount of produced Kafka messages is a bit less then the amount of messages we consume in our Spark application, so there's no backpressure needed (in the beginning). The Spark job creates many Parquet files inside our Spark Warehouse HDFS directory (x Partitions => x Parquet Files per Batch), as expected.
Everything runs just fine for hours, but after around 12-14 hours, our processing time increases rapidly, e.g. it jumped from the normal 10 seconds processing time to >1 minute from one batch to the next one. This of course leads to a huge batch queue after a short time.
We saw similar results for 5 minute batches (processing time is around 1.5 minutes here and suddenly increases to >10 minute per batch after period of time).
Similar results happened also when we wrote ORC instead of Parquet files.
Since the batches can run independently, we do not use the checkpointing feature of Spark Streaming.
We're using the Hortonworks Data Platform 3.1.4 with Spark 2.3.2 and Kafka 2.0.0.
Is this a known problem in Spark Streaming? Are there any dependencies on "old" batches for Parquet /ORC tables? Or is this a general file-based or Hadoop-based problem? Thanks for your help.

Processing Batch files coming every 2 minutes in system using spark Streaming

I am trying to process my near real time batch csv files using spark streaming. I am reading these files in batches of 100 files and doing some operation and writing to output files. I am using spark.readstream and spark.writestream functions to read and write streaming files.
I am trying to find out how I can stop the spark streaming?
stream_df = spark.readstream.csv(filepath_directory).
.option("maxFilesPerTrigger", 10)
query = spark.writeStream.format("parquet")..outputMode("append").option("output filepath")
I faced the issue while streaming job is running, due to some reason
my job is failing or some exceptions are.
I tried try, except.
I am planning to stop the query in case of any exceptions and process the same code again.
I used query.stop(), is this the right way to stop streaming job?
I read one post, I am not sure whether it is for Dstream and how to execute this code in my pyspark code.
https://www.linkedin.com/pulse/how-shutdown-spark-streaming-job-gracefully-lan-jiang

How can I achieve streaming data aggregation per batch using Spark Structured Streaming?

I am using Spark Structured Streaming to read from a bunch of files coming into my system to a specific folder.
I want to run a streaming aggregation query on the data and write the result to Parquet files every batch, using Append Mode. This way, Spark Structured Streaming performs a partial aggregation intra-batch that is written to disk and we read from the output Parquet files using a Impala table that points to the output directory.
So I need to have something like this:
batch aggregated_value
batch-1 10
batch-2 8
batch-3 17
batch-4 13
I actually don't need the batch column but it helps to clarify what I am trying to do.
Does Structured Streaming offer a way to achieve this?

Alternate to recursively Running Spark-submit jobs

Below is the scenario I would need suggestions on,
Scenario:
Data ingestion is done through Nifi into Hive tables.
Spark program would have to perform ETL operations and complex joins on the data in Hive.
Since the data ingested from Nifi is continuous streaming, I would like the Spark jobs to run every 1 or 2 mins on the ingested data.
Which is the best option to use?
Trigger spark-submit jobs every 1 min using a scheduler?
How do we reduce the over head and time lag in submitting the job recursively to the spark cluster? Is there a better way to run a single program recursively?
Run a spark streaming job?
Can spark-streaming job get triggered automatically every 1 min and process the data from hive? [Can Spark-Streaming be triggered only time based?]
Is there any other efficient mechanism to handle such scenario?
Thanks in Advance
If you need something that runs every minute you better use spark-streaming and not batch.
You may want to get the data directly from kafka and not from hive table, since it is faster.
As for your questions what is better batch / stream. You can think of spark streaming as micro batch process that runs every "batch interval".
Read this : https://spark.apache.org/docs/latest/streaming-programming-guide.html

Controlling Spark Streaming of the Files

I am using Spark to read the text files from a folder and load them to hive.
The interval for the spark streaming is 1 min. The source folder may have 1000 files of bigger size in rare cases.
How do i control spark streaming to limit the number of files the program reads? Currently my program is reading all files generated in last 1 min. But i want to control the number of files it's reading.
I am using textFileStream API.
JavaDStream<String> lines = jssc.textFileStream("C:/Users/abcd/files/");
Is there any way to control the file streaming rate?
I am afraid not.
Spark steaming is based on Time driven.
You can use Flink which provide Data driven
https://ci.apache.org/projects/flink/flink-docs-release-1.2/concepts/programming-model.html#windows
You could use "spark.streaming.backpressure.enabled" and "spark.streaming.backpressure.initialRate" for controlling the rate at which data is received!!!
If your files are CSV files, you can use structured streaming to read the files into a streaming DataFrame with maxFilesPerTrigger like this:
import org.apache.spark.sql.types._
val streamDf = spark.readStream.option("maxFilesPerTrigger", "10").schema(StructType(Seq(StructField("some_field", StringType)))).csv("/directory/of/files")

Resources