spark streaming with aggregation - apache-spark

I am trying to understand spark streaming in terms of aggregation principles.
Spark DF are based on the mini batches and computations are done on the mini batch that came within a specific time window.
Lets say we have data coming in as -
Window_period_1[Data1, Data2, Data3]
Window_period_2[Data4, Data5, Data6]
..
then first computation will be done for Window_period_1 and then for Window_period_2. If I need to use the new incoming data along with historic data lets say kind of groupby function between Window_period_new and data from Window_period_1 and Window_period_2, how would I do that?
Another way of seeing the same thing would be lets say if I have a requirement where a few data frames are already created -
df1, df2, df3 and I need to run an aggregation which will involve data from
df1, df2, df3 and Window_period_1, Window_period_2, and all new incoming streaming data
how would I do that?

Spark allows you to store state in rdd (with checkpoints). So, even after restart, job will restore it state from checkpoint and continie streaming.
However, we faced with performance problems with checkpoint (specially, after restoring state), so it is worth to implement storint state using some external source (like hbase)

Related

How does spark structured streaming job handle stream - static DataFrame join?

I have a spark structured streaming job which reads a mapping table from cassandra and deltalake and joins with streaming df. I would like to understand the exact mechanism here. Does spark hit these data sources(cassandra and deltalake) for every cycle of microbatch? If that is the case i see in spark web ui that these tables are read only once.
Please help me understand this.
Thanks in advance
"Does spark hit these data sources(cassandra and deltalake) for every cycle of microbatch?"
According to the book "Learning Spark, 2nd edition" from O'Reilly on static-stream joins it is mentioned that the static DataFrame is read in every micro-batch.
To be more precise, I find the following section in the book quite helpful:
Stream-static joins are stateless operations, and therfore do not required any kind of watermarking
The static DataFrame is read repeatedly while joining with the streaming data of every micro-batch, so you can cache the static DataFrame to speed up reads.
If the underlying data in the data source on which the static DataFrame was defined changes, wether those changes are seen by the streaming query depends on the specific behavior of the data source. For example, if the static DataFrame was defined on files, then changes to those files (e.g. appends) will not be picked up until the streaming query is restarted.
When applying a "static-stream" join it is assumed that the static part is not changing at all or only slowly changing. If you plan to join two rapidly changing data sources it is required to switch to a "stream-stream" join.

Spark dataframe creation through already distributed in-memory data sets

I am new to the Spark community. Please ignore if this question doesn't make sense.
My PySpark Dataframe is just taking a fraction of time (in ms) in 'Sorting', but moving data is much expensive (> 14 sec).
Explanation:
I have a huge Arrow RecordBatches collection which is equally distributed on all of my worker node's memories (in plasma_store). Currently, I am collecting all those RecordBatches back in my master node, merging them, and converting them to a single Spark Dataframe. Then I apply sorting function on that dataframe.
Spark dataframe is a cluster distributed data collection.
So my question is:
Is it possible to create a Spark dataframe from all that already distributed Arrow RecordBatches data collections in the worker's nodes memories? So that the data should remain in the respective worker's nodes memories (instead of bringing it to master node, merging, and then creating distributed dataframe).
Thanks!
Yes you can store the data in a spark cache, whenever you try to get the data, it would get you from cache rather than the source.
Please utilize below kinks to understand more on cache,
https://sparkbyexamples.com/spark/spark-dataframe-cache-and-persist-explained/
where does df.cache() is stored
https://unraveldata.com/to-cache-or-not-to-cache/

Spark Structured Streaming foreachBatch and UPSERT (merge): to persist or not to persist?

In case of stateful aggregation (arbitrary) in Structured Streaming with foreachBatch to merge update into delta table, should I persist batch dataframe inside foreachBatch before upserting or not?
It seems for be that persist is not required since i'm writing to single data sink.
On the other hand i have strong feeling that not persisting will cause source re-scan and trigger aggregation twice.
Any comments/thoughts?
foreachBatch((VoidFunction2<Dataset<Row>, Long>) (batchDf, batchId) ->
deltaTable.as("table").merge(batchDf.as("updates"), functions.expr("table.id=updates.id"))
.whenNotMatched().insertAll() // new session to be added
.whenMatched()
.updateAll()
.execute())
So the answer from delta-users (https://groups.google.com/g/delta-users/c/Ihm6PMilCdI) is:
DeltaTable.merge (upsert) does two passes on the source data.
So if you DO care about the Spark metrics or logs in Arbitrary Stateful Aggregation inside mapGroupsWithState/flatmapGroupsWithState - do persist/cache before merge inside foreachBatch, otherwise sent metrics will have double (x2) values and logs aggregation logs will be emitted twice
Let me cite the page below:
To avoid recomputations, you should cache the output DataFrame/Dataset, write it to multiple locations, and then uncache it.
I don't know if you have already visited this page but seems that you are correct that persist is not necessary in your case. It is essential for multiple locations.
Source: https://docs.databricks.com/spark/latest/structured-streaming/foreach.html

How does Apache spark sql lineage evolve as we run sql updates on an dataframe?

I am trying to develop a backend module, which would require me to do several sql updates on a DataFrame, backed by a parquet format in hdfs. What I am interested in knowing is how does multiple sql updates affect DataFrame's RDD lineage and also, would it be a concern to execute several, frequent sql updates on a DataFrame, since according to my understanding a single sql update on a DataFrame would be a transformation. Is there anything equivalent of doing a batch update to a dataframe in a single lineage step?
Two important notes:
Spark DataFrames are immutable and as such cannot be updated. You can only create a new DataFrame.
Transformations and lineage are RDD specific. While internally every set of operations on a DataFrame (Dataset) is translated to some DAG and executed using RDD there is no trivial correspondence between RDD stages and methods you apply on Dataset. Operators can be transparently rearranged, removed or squashed together. How exactly query is transformed is not a part of contract and if you're interested in details for a given version you should check the execution plan as well as explain DAG of the corresponding RDD.
In general you can expect that individual operations may require between zero (if particular operation is eliminated by projections or usage of trivial predicates) and two stages (typical aggregations). Projections are typically scheduled together if possible, aggregation behavior changed over time.
Finally some operation may require more than one job to infer schema or compute statistics.

How to update an RDD?

We are developing Spark framework wherein we are moving historical data into RDD sets.
Basically, RDD is immutable, read only dataset on which we do operations.
Based on that we have moved historical data into RDD and we do computations like filtering/mapping, etc on such RDDs.
Now there is a use case where a subset of the data in the RDD gets updated and we have to recompute the values.
HistoricalData is in the form of RDD.
I create another RDD based on request scope and save the reference of that RDD in a ScopeCollection
So far I have been able to think of below approaches -
Approach1: broadcast the change:
For each change request, my server fetches the scope specific RDD and spawns a job
In a job, apply a map phase on that RDD -
2.a. for each node in the RDD do a lookup on the broadcast and create a new Value which is now updated, thereby creating a new RDD
2.b. now I do all the computations again on this new RDD at step2.a. like multiplication, reduction etc
2.c. I Save this RDDs reference back in my ScopeCollection
Approach2: create an RDD for the updates
For each change request, my server fetches the scope specific RDD and spawns a job
On each RDD, do a join with the new RDD having changes
now I do all the computations again on this new RDD at step2 like multiplication, reduction etc
Approach 3:
I had thought of creating streaming RDD where I keep updating the same RDD and do re-computation. But as far as I understand it can take streams from Flume or Kafka. Whereas in my case the values are generated in the application itself based on user interaction.
Hence I cannot see any integration points of streaming RDD in my context.
Any suggestion on which approach is better or any other approach suitable for this scenario.
TIA!
The usecase presented here is a good match for Spark Streaming. The two other options bear the question: "How do you submit a re-computation of the RDD?"
Spark Streaming offers a framework to continuously submit work to Spark based on some stream of incoming data and preserve that data in RDD form. Kafka and Flume are only two possible Stream sources.
You could use Socket communication with the SocketInputDStream, reading files in a directory using FileInputDStream or even using shared Queue with the QueueInputDStream. If none of those options fit your application, you could write your own InputDStream.
In this usecase, using Spark Streaming, you will read your base RDD and use the incoming dstream to incrementally transform the existing data and maintain an evolving in-memory state. dstream.transform will allow you to combine the base RDD with the data collected during a given batch interval, while the updateStateByKey operation could help you build an in-memory state addressed by keys. See the documentation for further information.
Without more details on the application is hard to go up to the code level on what's possible using Spark Streaming. I'd suggest you to explore this path and make new questions for any specific topics.
I suggest to take a look at IndexedRDD implementation, which provides updatable RDD of key value pairs. That might give you some insights.
The idea is based on the knowledge of the key and that allows you to zip your updated chunk of data with the same keys of already created RDD. During update it's possible to filter out previous version of the data.
Having historical data, I'd say you have to have sort of identity of an event.
Regarding streaming and consumption, it's possible to use TCP port. This way the driver might open a TCP connection spark expects to read from and sends updates there.

Resources