Apache Spark Streaming maintain a binary search tree - apache-spark

I want to understand how to do the following thing:
I want to maintain a binary search tree(BST) with spark.
I have 2 simple operations, and I get them in streaming.
That's why I thought about using Spark Streaming.
The operations are the following:
a)add a number to the BST
b)delete a number
*Lets assume that I don't have duplicate numbers.
How can I do it the right way? My main problem is that I'm not sure where should I keep the tree.(Supposing it's size always fits in my RAM)
For me, the "Big Data" here is the number of operations, so i want to use Spark Streaming in order to handle with lots of operations which come in streams.
Again, the tree is being kept small, and will always fit in the RAM. (What if it doesn't?)
What would be the best approach?
In addition to that, I would like to do the same things using Stack data structure instead of BST.
The operations are only push and pop numbers.
Maybe Apache Storm will be better for those tasks?

For the stack you can use redis with key as counter or timestamp when pushing, when pooping pop the latest.

For the BST YOU could use graph x, and use it as a distributed data structure.
Another approach could be with Akka, http://alexminnaar.com/building-a-distributed-binary-search-tree-with-akka.htm.
For the stack you may use, pair rdd with d stream, key being the the time stamp while pushing, but not sure how to pop.

Related

Spark with divide and conquer

I'm learning Spark and trying to process some huge dataset. I don't understand why I don't see decrease in stage completion times with following strategy (pseudo):
data = sc.textFile(dataset).cache()
while True:
data.count()
y = data.map(...).reduce(...)
data = data.filter(lambda x: x < y).persist()
So idea is to pick y so that it most of the time ~halves the data. But for some reason it looks like all the data is always processed again on each count().
Is this some kind of an anti-pattern? How I'm supposed to do this with Spark?
Yes, that is an anti-pattern.
map, same as most, but not all, of the distributed primitives in Spark, is pretty much by definition a divide and conquer approach. You take the data, you compute splits, and transparently distribute computing of individual splits over the cluster.
Trying to further divide this process, using high level API, makes no sense at all. At best it will provide no benefits at all, at worst it will incur the cost of multiple data scans, caching and spills.
Spark is lazily evaluated so in the for or while loop above each call to data.filter does not sequentially return the data but instead sequentially returns Spark calls to be executed later. All these calls get aggregated and then executed simultaneously when you do something later.
In particular, results remain unevaluated and merely represented until a Spark Action gets called. Past a certain point the application can’t handle that many parallel tasks.
In a way we’re running into a conflict between two different representations: conventional structured coding with its implicit (or at least implied) execution patterns and independent, distributed, lazily-evaluated Spark representations.

Can Spark automatically detect nondeterministic results and adjust failure recovery accordingly?

If nondeterministic code runs on Spark, this can cause a problem when recovery from failure of a node is necessary, because the new output may not be exactly the same as the old output. My interpretation is that the entire job might need to be rerun in this case, because otherwise the output data could be inconsistent with itself (as different data was produced at different times). At the very least any nodes that are downstream from the recovered node would probably need to be restarted from scratch, because they have processed data that may now change. That's my understanding of the situation anyway, please correct me if I am wrong.
My question is whether Spark can somehow automatically detect if code is nondeterministic (for example by comparing the old output to the new output) and adjust the failure recovery accordingly. If this were possible it would relieve application developers of the requirement to write nondeterministic code, which might sometimes be challenging and in any case this requirement can easily be forgotten.
No. Spark will not be able to handle non deterministic code in case of failures. The fundamental data structure of Spark, RDD is not only immutable but it
should also be determinstic function of it's input. This is necessary otherwise Spark framework will not be able to recompute the partial RDD (partition) in case of
failure. If the recomputed partition is not deterministic then it had to re-run the transformation again on full RDDs in lineage. I don't think that Spark is a right
framework for non-deterministic code.
If Spark has to be used for such use case, application developer has to take care of keeping the output consistent by writing code carefully. It can be done by using RDD only (no datframe or dataset) and persisting output after every transformation executing non-determinstic code. If performance is the concern, then the intermediate RDDs can be persisted on Alluxio.
A long term approach would be to open a feature request in apache spark jira. But I am not too positive about the acceptance of feature. A little hint in syntax to know wether code is deterministic or not and framework can switch to recover RDD partially or fully.
Non-deterministic results are not detected and accounted for in failure recovery (at least in spark 2.4.1, which I'm using).
I have encountered issues with this a few times on spark. For example, let's say I use a window function:
first_value(field_1) over (partition by field_2 order by field_3)
If field_3 is not unique, the result is non-deterministic and can differ each time that function is run. If a spark executor dies and restarts while calculating this window function, you can actually end up with two different first_value results output for the same field_2 partition.

Order Guarantee with Sparking Streaming

I am trying to get some change event from Kafka that I would like to propagate downstream in another system. However the Change order matters. Hence I wonder what is the appropriate way to do that with some Spark transformation in the middle.
The only thing I see is to loose the parallelism and make the DStream on one partition. Maybe there is a way to do operation in parallel and bring everything back in one partition and then send it to the external system or back in Kafka and then use a Kafka Sink for the matter.
What approach can I try?
In a distributed environment, with some form of cashing/buffering at most layer, message generated from same machine may reach back-end in different order. Also the definition of order is subjective. Implementing a global definition of order will be restrictive (may not be correct) for the data as a whole.
So, Kafka is meant for keeping the data in order in the order of put but partition comes as a catch!!! Partition defines the level of parallelism per topic.
Typically, the level of abstraction at which kafka is kept, it should not bother much about order. It should be optimised for maximum throughput, where partitioning will come handy!!! Consider ordering just a side effect of supporting streaming!!!
Now, what ever logic ensures, that data is put in to kafka in order, that makes more sense in your application (spark job).

Spark Streaming - TIMESTAMP field based processing

I'm pretty new to spark streaming and I need some basic clarification that I couldn't fully understand reading the documentation.
The use case is that I have a set of files containing dumping EVENTS, and each events has already inside a field TIMESTAMP.
At the moment I'm loading this file and extracting all the events in a JavaRDD and I would like to pass them to Spark Streaming in order to collect some stats based on the TIMESTAMP (a sort of replay).
My question is if it is possible to process these event using the EVENT TIMESTAMP as temporal reference instead of the actual time of the machine (sorry for the silly question).
In case it is possible, will I need simply spark streaming or I need to switch to Structured Streaming?
I found a similar question here:
Aggregate data based on timestamp in JavaDStream of spark streaming
Thanks in advance
TL;DR
yes you could use either Spark Streaming or Structured Streaming, but I wouldn't if I were you.
Detailed answer
Sorry, no simple answer to this one. Spark Streaming might be better for the per-event processing if you need to individually examine each event. Structured Streaming will be a nicer way to perform aggregations and any processing where per-event work isn't necessary.
However, there is a whole bunch of complexity in your requirements, how much of the complexity you address depends on the cost of inaccuracy in the Streaming job output.
Spark Streaming makes no guarantee that events will be processed in any kind of order. To impose ordering, you will need to setup a window in which to do your processing that minimises the risk of out-of-order processing to an acceptable level. You will need to use a big enough window of data to accurately capture your temporal ordering.
You'll need to give these points some thought:
If a batch fails and is retried, how will that affect your counters?
If events arrive late, will you ignore them, re-process the whole affected window, or update the output? If the latter how can you guarantee the update is done safely?
Will you minimise risk of corruption by keeping hold of a large window of events, or accept any inaccuracies that may arise from a smaller window?
Will the partitioning of events cause complexity in the order that they are processed?
My opinion is that, unless you have relaxed constraints over accuracy, Spark is not the right tool for the job.
I hope that helps in some way.
It is easy to do aggregations based on event-time with Spark SQL (in either batch or structured streaming). You just need to group by a time window over your timestamp column. For example, the following will bucket you data into 1 minute intervals and give you the count for each bucket.
df.groupBy(window($"timestamp", "1 minute") as 'time)
.count()

Spark: Importing Data

I currently have a spark app that reads a couple of files and forms a data frame out of them and implements some logic on the data frames.
I can see the number and size of these files growing by a lot in the future and wanted to understand what goes on behind the scenes to be able to keep up with this growth.
Firstly, I just wanted to double check that since all machines on the cluster can access the files (which is a requirement by spark), the task of reading in data from these files is distributed and no one machine is burdened by it?
I was looking at the Spark UI for this app but since it only shows what actions were performed by which machines and since "sc.textFile(filePath)" is not an action I couldn't be sure what machines are performing this read.
Secondly, what advantages/disadvantages would I face if I were to read this data from a database like Cassandra instead of just reading in files?
Thirdly, in my app I have some code where I perform a collect (val treeArr = treeDF.collect()) on the dataframe to get an array and then I have some logic implemented on those arrays. But since these are not RDDs, how does Spark distribute this work? Or does it distribute them at all?
In other words, should I be doing maximum amount of my work transforming and performing actions on RDDs than converting them into arrays or some other data structure and then implementing the logic like I would in any programming language?
I am only about two weeks into Spark so I apologize if these are stupid questions!
Yes, sc.textFile is distributed. It even has an optional minPartitions argument.
This question is too broad. But the short answer is that you should benchmark it for yourself.
collect fetches all the data to the master. After that it's just a plain array. Indeed the idea is that you should not use collect if you want to perform distributed computations.

Resources