How many batches / RDDs are present in one DStream? - apache-spark

I understand that in one batchInterval one RDD gets generated when there is continuous stream of data. How is the number of RDDs decided for one DStream?

Related

Why Spark create less partitions than the number of files whem reading from S3

I'm using Spark 2.3.1.
I have a job that reads 5.000 small parquet files into s3.
When I do a mapPartitions followed by a collect, only 278 tasks are used (I would have expected 5000). Why ?
Spark is grouping multiple files into each partition due to their small size. You should see as much when you print out the partitions.
Example (Scala):
val df = spark.read.parquet("/path/to/files")
df.rdd.partitions.foreach(println)
If you want to use 5,000 task you could do a repartition transformation.
Quote from the docs about repartition:
Reshuffle the data in the RDD randomly to create either more or fewer
partitions and balance it across them. This always shuffles all data
over the network.
I recommend you take a look at the RDD Programming Guide. Remember that shuffle is an expensive operation.

How can I accumulate Dataframes in Spark Streaming?

I know Spark Streaming produces batches of RDDs, but I'd like to accumulate one big Dataframe that updates with each batch (by appending new dataframe to the end).
Is there a way to access all historical Stream data like this?
I've seen mapWithState() but I haven't seen it accumulate Dataframes specifically.
While Dataframes are implemented as batches of RDDs under the hood, a Dataframe is presented to the application as an non-discrete infinite stream of rows. There are no "batches of dataframes" as there are "batches of RDDs".
It's not clear what historical data you would like.

spark transform method behaviour in multiple partition

I am using Kafka Streaming to read data from Kafka topic, and i want to join every RDD that i get in the stream to an existing RDD. So i think using "transform" is the best option (Unless any one disagrees, and suggest a better approach)
And, I read following example of "transform" method on DStreams in Spark:
val spamInfoRDD = ssc.sparkContext.newAPIHadoopRDD(...) // RDD containing spam information
val cleanedDStream = wordCounts.transform { rdd =>
rdd.join(spamInfoRDD).filter(...) // join data stream with spam information to do data cleaning
...
}
But lets say, i have 3 partitions in the Kafka topic, and that i invoke 3 consumers to read from those. Now, this transform method will be called in three separate threads in parallel.
I am not sure if joining the RDDs in this case will be Thread-safe and this will not result in data-loss. (considering that RDDs are immutable)
Also, if you say that its thread-safe then wouldn't the performance be low since we are creating so many RDDs and then joining them?
can anybody suggest?

Apache Spark Streaming : How to compare 2 dataframes in 2 dstreams

I am a beginner with Apache Spark. I am trying to run a stream job which recieves some data ,convert it into dataframe and run some processing like joining and removing duplicates etc . Now I have to cache this processed data so that I can append this with next dstream (using some union/join) and do processing again.
I tried using dataframe.cache() to cache and re use this in next stream batch.
For example,if df is rdd formed from dstream.
foreachrdd{
new =df.unionAll(processed)
new.registerTempTable("TableScheme")
sql.( //perform inner join and some other processing)
processed=new
processed.cache();
}
When we perform Dataframe.cache or Dataframe.persist() are we caching the actual data or the DAG / transformations applied ? When second stream comes, my program exits with
Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
org.apache.spark.SparkException: Attempted to use BlockRDD[1] at socketTextStream after its blocks have been removed!
The cache() functionality performs the processing defined in the DAG up to the cache() call and stores the data appropriately so that computations are not repeated, should more than one action be performed.
What you want is some persistence across stream batches.
There are several ways to do that:
streamingContext.remember(Minutes(5)) holds on to data from previous batches for Minutes(5) amount of time
windowing move a fixed time window across the data, allowing you to perform operations on more than one batch of data
updateStateByKey & mapWithState provide mechanisms to maintain and transform state across batches
The approach you choose will depend largely on your use-case.

Spark RDD apend

In Spark, I loaded a data set as RDD and like to infrequently append streaming data to it. I know RDDs are immutable because it simplifies locking, etc. Are the other approaches to processing static and streaming data together as one?
Similar question has been asked before:
Spark : How to append to cached rdd?
Have a look at http://spark.apache.org/streaming/.
With spark streaming, you get a data structure representing a collection of RDDs you can iterate over. It can listen to a kafka queue, file system, etc to find new data to include in the next RDD.
Or if you only do these "appends" rarely, you can union two RDDs with the same schema to get a new combined RDD.

Resources