I know Spark Streaming produces batches of RDDs, but I'd like to accumulate one big Dataframe that updates with each batch (by appending new dataframe to the end).
Is there a way to access all historical Stream data like this?
I've seen mapWithState() but I haven't seen it accumulate Dataframes specifically.
While Dataframes are implemented as batches of RDDs under the hood, a Dataframe is presented to the application as an non-discrete infinite stream of rows. There are no "batches of dataframes" as there are "batches of RDDs".
It's not clear what historical data you would like.
Related
I understand that in one batchInterval one RDD gets generated when there is continuous stream of data. How is the number of RDDs decided for one DStream?
It was my assumption that Spark Data Frames were built from RDDs. However, I recently learned that this is not the case, and Difference between DataFrame, Dataset, and RDD in Spark does a good job explaining that they are not.
So what is the overhead of converting an RDD to a DataFrame, and back again? Is it negligible or significant?
In my application, I create a DataFrame by reading a text file into an RDD and then custom-encoding every line with a map function that returns a Row() object. Should I not be doing this? Is there a more efficient way?
RDDs have a double role in Spark. First of all is the internal data structure for tracking changes between stages in order to manage failures and secondly until Spark 1.3 was the main interface for interaction with users. Therefore after after Spark 1.3 Dataframes constitute the main interface offering much richer functionality than RDDs.
There is no significant overhead when converting one Dataframe to RDD with df.rdd since the dataframes they already keep an instance of their RDDs initialized therefore returning a reference to this RDD should not have any additional cost. On the other side, generating a dataframe from an RDD requires some extra effort. There are two ways to convert an RDD to dataframe 1st by calling rdd.toDF() and 2nd with spark.createDataFrame(rdd, schema). Both methods will evaluate lazily although there will be an extra overhead regarding the schema validation and execution plan (you can check the toDF() code here for more details). Of course that would be identical to the overhead that you have just by initializing your data with spark.read.text(...) but with one less step, the conversion from RDD to dataframe.
This the first reason that I would go directly with Dataframes instead of working with two different Spark interfaces.
The second reason is that when using the RDD interface you are missing some significant performance features that dataframes and datasets offer related to Spark optimizer (catalyst) and memory management (tungsten).
Finally I would use the RDDs interface only if I need some features that are missing in dataframes such as key-value pairs, zipWithIndex function etc. But even then you can access those via df.rdd which is costless as already mentioned. As for your case , I believe that would be faster to use directly a dataframe and use the map function of that dataframe to ensure that Spark leverages the usage of tungsten ensuring efficient memory management.
The definition of DStream from the documentation states,
Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset.
The question is if it is represented as series of RDDs, can we make Stream of RDD and expect it to work similar to DStream?
It would be great if someone can help me to understand this with a code sample.
The question is if it is represented as series of RDDs, can we make Stream of RDD and expect it to work similar to DStream?
You're right. A DStream is logically a series of RDDs.
Spark Streaming is just to hide the process of creating Seq[RDD] so it is not your job but the framework.
Moreover, Spark Streaming gives you a much nicer developer API so you can think of Seq[RDD] as a DStream, but rather than rdds.map(rdd => your code goes here) you can simply dstream.map(t => your code goes here) which is not that different except the types of rdd and t. You're simply one level below already when working with DStream.
In Spark, I loaded a data set as RDD and like to infrequently append streaming data to it. I know RDDs are immutable because it simplifies locking, etc. Are the other approaches to processing static and streaming data together as one?
Similar question has been asked before:
Spark : How to append to cached rdd?
Have a look at http://spark.apache.org/streaming/.
With spark streaming, you get a data structure representing a collection of RDDs you can iterate over. It can listen to a kafka queue, file system, etc to find new data to include in the next RDD.
Or if you only do these "appends" rarely, you can union two RDDs with the same schema to get a new combined RDD.
I am trying to understand spark streaming in terms of aggregation principles.
Spark DF are based on the mini batches and computations are done on the mini batch that came within a specific time window.
Lets say we have data coming in as -
Window_period_1[Data1, Data2, Data3]
Window_period_2[Data4, Data5, Data6]
..
then first computation will be done for Window_period_1 and then for Window_period_2. If I need to use the new incoming data along with historic data lets say kind of groupby function between Window_period_new and data from Window_period_1 and Window_period_2, how would I do that?
Another way of seeing the same thing would be lets say if I have a requirement where a few data frames are already created -
df1, df2, df3 and I need to run an aggregation which will involve data from
df1, df2, df3 and Window_period_1, Window_period_2, and all new incoming streaming data
how would I do that?
Spark allows you to store state in rdd (with checkpoints). So, even after restart, job will restore it state from checkpoint and continie streaming.
However, we faced with performance problems with checkpoint (specially, after restoring state), so it is worth to implement storint state using some external source (like hbase)