What is the best way to store incoming streaming data? - apache-spark

What is a better choice for a long-term store (many writes, few reads) of data processed through Spark Streaming: Parquet, HBase or Cassandra? Or something else? What are the trade-offs?

In my experience we have used Hbase as datastore for spark streaming data(we also has same scenario many writes and few reads), since we are using hadoop, hbase has native integration with hadoop and it went well..
Above we have used tostore hight rate of messages coming over from solace.
HBase is well suited for doing Range based scans. Casandra is known for availablity and many other things...
However, I can also observe one general trend in many projects, they are simply storing rawdata in hdfs (parquet + avro) in partitioned structure through spark streaming with spark dataframe(SaveMode.Append) and they are processing rawdata with Spark
Ex of partitioned structure in hdfs :
completion ofbusinessdate/environment/businesssubtype/message type etc....
in this case there is no need for going to Hbase or any other data store.
But one common issue in above approach is when you are getting small and tiny files, through streaming then you would need to repartion(1) or colelese or FileUtils.copymerge to meet block size requirements to single partitioned file. Apart from that above approach also would be fine.
Here is some thing called CAP theorm based on which decision can be taken.
Consistency (all nodes see the same data at the same time).
Availability (every request receives a response about whether it
succeeded or failed).
Partition tolerance (the system continues to
operate despite arbitrary partitioning due to network failures)
Casandra supports AP.
Hbase supports CP.
Look at detailed analysis given here

Related

Best deduplication strategy to be used with spark

What is the best de-duplication strategy to be used with spark?
I have a Kafka source that is continuously fed with structured information (say JSON) from various producers continuously.
I am having an HDInsight spark cluster that can pick messages in real time for this Kafka source, process them and put it into a destination Kafka source in real time.
My use case demands that the information received from the source may have duplicates which need to be eliminated. The duplicates have to be be checked against say last 24 hours.
My attempt :
I tried using the .dropduplicate method in spark along with watermarking , but I think it's not the best thing to do since the data for a single day window may exceed 50 GB in my use case.
I also looked for bloom filter implementation which can be used with spark but couldn't find a good one.
My question:
What are the possible approaches to eliminate duplication in general for large scale spark streaming application.?
Which of these features can be used along with HDInsight clusters on Azure ?
What are the fault tolerance capability in such services ?

Why there is no JDBC Spark Streaming receiver?

I suggest it's a good idea to process huge JDBC table by reading rows by batches and processing them with Spark Streaming. This approach doesn't require reading all rows into memory. I suppose no monitoring of new rows in the table, but just reading the table once.
I was surprised that there is no JDBC Spark Streaming receiver implementation. Implementing Receiver doesn't look difficult.
Could you describe why such receiver doesn't exist (is this approach a bad idea?) or provide links to implementations.
I've found Stratio/datasource-receiver. But it reads all data in a DataFrame before processing by Spark Streaming.
Thanks!
First of all actual streaming source would require a reliable mechanism for monitoring updates, which is simply not a part of JDBC interface nor it is a standardized (if at all) feature of major RDBMs, not to mention other platforms, which can be accessed through JDBC. It means that streaming from a source like this typically requires using log replication or similar facilities and is highly resource dependent.
At the same what you describe
suggest it's a good idea to process huge JDBC table by reading rows by batches and processing them with Spark Streaming. This approach doesn't require reading all rows into memory. I suppose no monitoring of new rows in the table, but just reading the table once
is really not an use case for streaming. Streaming deals with infinite streams of data, while you ask is simply as scenario for partitioning and such capabilities are already a part of the standard JDBC connector (either by range or by predicate).
Additionally receiver based solutions simply don't scale well and effectively model a sequential process. As a result their applications are fairly limited, and wouldn't be even less appealing if data was bounded (if you're going to read finite data sequentially on a single node, there is no value in adding Spark to the equation).
I don't think it is a bad idea since in some cases you have constraints that are outside your power,e.g. legacy systems to which you cannot apply strategies such as CDC but to which you still have to consume as a source of stream data.
On the other hand, Spark Structure Streaming engine, in micro-batch mode, requires the definition of an offset than can be advanced, as you can see in this class. So, if your table has some column that can be used as an offset, you can definitely stream from it, although RDMDS are not the "streaming-friendly" as far as I know.
I have developed Jdbc2s which is a DataSource V1 streaming source for Spark. It's also deployed to Maven Central, if you need. Coordinates are in the documentation.

Spark Direct Stream Kafka order of events

I have a question regarding reading data with Spark Direct Streaming (Spark 1.6) from Kafka 0.9 saving in HBase.
I am trying to do updates on specific row-keys in an HBase table as recieved from Kafka and I need to ensure the order of events is kept (data received at t0 is saved in HBase for sure before data received at t1 ).
The row key, represents an UUID which is also the key of the message in Kafka, so at Kafka level, I am sure that the events corresponding to a specific UUID are ordered at partition level.
My problem begins when I start reading using Spark.
Using the direct stream approach, each executor will read from one partition. I am not doing any shuffling of data (just parse and save), so my events won't get messed up among the RDD, but I am worried that when the executor reads the partition, it won't maintain the order so I will end up with incorrect data in HBase when I save them.
How can I ensure that the order is kept at executor level, especially if I use multiple cores in one executor (which from my understanding result in multiple threads)?
I think I can also live with 1 core if this fixes the issue and by turning off speculative execution, enabling spark back pressure optimizations and keeping the maximum retries on executor to 1.
I have also thought about implementing a sort on the events at spark partition level using the Kafka offset.
Any advice?
Thanks a lot in advance!

Apache Spark node asking master for more data?

I'm trying to benchmark a few approaches to putting an image processing algorithm into apache spark. For one step in this algorithm, a computation on a pixel in the image will depend on an unknown amount of surrounding data, so we can't partition the image with guaranteed sufficient overlap a priori.
One solution to that problem I need to benchmark is for a worker node to ask the master node for more data when it encounters a pixel with insufficient surrounding data. I'm not convinced this is the way to do things, but I need to benchmark it anyway because of reasons.
Unfortunately, after a bunch of googling and reading docs I can't find any way for a processingFunc called as part of sc.parallelize(partitions).map(processingFunc) to query the master node for more data from a different partition mid-computation.
Does a way for a worker node to ask the master for more data exist in spark, or will I need to hack something together that kind of goes around spark?
Master Node in Spark is for allocating the resources to a particular job and once the resources are allocated, the Driver ships the complete code with all its dependencies to the various executors.
The first step in every code is to load the data to the Spark cluster. You can read the data from any underlying data repository like Database, filesystem, webservices etc.
Once data is loaded it is wrapped into an RDD which is partitioned across the nodes in the cluster and further stored in the workers/ Executors Memory. Though you can control the number of partitions by leveraging various RDD API's but you should do it only when you have valid reasons to do so.
Now all operations are performed over RDD's using its various methods/ Operations exposed by RDD API. RDD keep tracks of partitions and partitioned data and depending upon the need or request it automatically query the appropriate partition.
In nutshell, you do not have to worry about the way data is partitioned by RDD or which partition stores which data and how they communicate with each other but if you do care, then you can write your own custom partitioner, instructing Spark of how to partition your data.
Secondly if your data cannot be partitioned then I do not think Spark would be an ideal choice because that will result in processing of everything in 1 single machine which itself is contrary to the idea of distributed computing.
Not sure what is exactly your use case but there are people who have been leveraging Spark for Image processing. see here for the comments from Databricks

Spark Streaming with large number of streams and models used for analytical processing of RDDs

We are creating a real-time stream processing system with spark streaming which uses large number (millions) of analytic models applied to RDDs in the many different type of incoming metric data streams(more then 100000). This streams are original or transformed streams. Each RDD has to go through an analytical model for processing. Since we do not know which spark cluster node will process which specific RDDs from different streams, we need to make ALL these models available at each Spark compute node. This will create huge overhead at each spark node. We are considering using in-memory data grids to provide these models at spark compute nodes. Is this the right approach?
Or
Should we avoid using Spark streaming all together and just use in-memory data grids like Redis(with pub/sub) to solve this problem. In that case we will stream data to specific Redis nodes which contain the specific models. of course we will have to do all binning/window etc..
Please suggest.
Sounds like to me like you need a combination of stream processing engine and a distributed data store. I would design the system like this.
The distributed datastore (Redis, Cassandra, etc.) can have the data you want to access from all the nodes.
Receive the data streams through a combination data ingestion system (Kafka, Flume, ZeroMQ, etc.) and process it in the stream processing system (Spark Streaming [preferably ;)], Storm, etc.).
In the functions that is used to process the stream records, the necessary data will have to pulled from the data store and maybe cached locally as appropriate.
You may also have to update the data store from spark streaming as application needs it. In which case you will also have to worry about versioning of the data that you want pull in step 3.
Hopefully that made sense. Its hard to give any more specifics of the implementation without the exactly computation model. Hope this helps!

Resources