What is the best way to get Dataframe Abstraction over HBase Data without Pheonix - apache-spark

I want to save and read the data from HBase from/to Spark.
I want to get the Dataframe abstraction as dataframe is best for memory management compared to RDD and it is convenient to do any processing.
I looked at possible candidates for getting Dataframe abstraction. One of them is Phoenix based solution. I do not want to have pheonix layer on top of HBase due to approvals. I searched for other solutions, but would want to know the best possibility that someone had tried.

We have a performant one at Splice Machine (Open Source). We wrote a separate InputFormat for HBase so we can read directly from store files in hbase vs. performing remote scans. The killer for Spark performance on top of hbase is the remote scan based InputFormat (i.e. how you read the data).
Sean Busbey at Cloudera has worked on a Spark HBase connector and here is a blog from HortonWorks on a similar idea...
http://hortonworks.com/blog/spark-hbase-dataframe-based-hbase-connector/
The "connectors" functionally work but perform poorly for large data sets.
Hope this helps and good luck.

Related

Differences in Execution betwen Hive and Spark

All: I am looking for someone with more knowledge to check my understanding of Hive and Spark
I have been researching different large scale database solutions and I am trying to understand the difference in execution between Hive and Spark. I attempted to install Hadoop, Hive, and Spark to see how they perform. I was able to get Hadoop and Spark to work. I was unable to get Hive to work.
When I ran queries in Spark after they passed through the optimizer, it seems that the biggest advantage is that only the relevant table data is selected from the source at the earliest inception. So if I only needed Table1.columns(A,B,C) in the final answer, but told the system to JOIN Table1 & Table2 on (Table1.A=Table2.B) it immediately reduces the carried table to only the relevant items...I do not think Hive performs that way. I believe it will do the full join and perform the reduction later.
There are also differences in the memory storage (Hive going back the the HDFS frequently, vs Spark keeping things in RAM). This has both advantages and disadvantages depending on the data set/query.
Unfortunately because I cannot get Hive to run, my theory is based off of reading outputs of other people running things in Hive.
I Think hive and spark originally have different goals, and their execution styles are based on those goals.
Apache spark is a framework that allows you to do calculations on big datasets. stored on hdfs
Hive is an SQL interface to retriev data stored in an hdfs, and other clusterized and object store filesystems (S3 is an example) in a structured way.
Spark keeps things on ram because its more focused on making calculations with the data sets. Hive is more focused on retrieving data in a structured way, so it does not focus on speed that much (that being said, there have been improvements in hive, like llap that are meant to improve performance).
I like to use analogies with traditional software tools. On one side, you can have a relational database, and on the other side, a programming language. They both overlap in some functionality (you can write and read to disk with the programming language, and you can do some calculations with the sql engine. However, if the task at hand requires intensive and complex calculations you would probably use the programming language. If you are looking for a system that lets you store data in a structured way, you would go for the sql engine.
Hive on Tez and Spark both use Ram(memory) for operating on data . The number of partitions computed which will be treated as individual tasks would be quite different from Hive on Tez vs Spark . Hive on Tez by default tries to use combiner to merge certain splits into single partition . Hive one Tez seem to handle autoscaling of clusters in a better way than spark and does work most of the time.Spark doesn't work with autoscaling it would have lot of shuffle errors and will fail when there are multiple stages . But given a fixed size of cluster Spark seems to perform better over Hive on TEZ this could be attributed to some of the optimizations done and also how the shuffle ,serialization etc are implemented .

Can Apache Spark be used in place of Sqoop

I have tried connecting spark with JDBC connections to fetch data from MySQL / Teradata or similar RDBMS and was able analyse the data.
Can spark be used to store the data to HDFS?
Is there any possibility for spark outperforming
the activities of Sqoop.
Looking for you valuable answers and explanations.
There are two main things about Sqoop and Spark. The main difference is Sqoop will read the data from your RDMS doesn't matter what you have and you don't need to worry much about how you table is configured.
With Spark using JDBC connection is a little bit different how you need to load the data. If your database doesn't have any column like numeric ID or timestamp Spark will load ALL the data in one single partition. And then will try to process and save. If you have one column to use as partition than Spark sometimes can be even faster than Sqoop.
I would recommend you to take a look in this doc.enter link description here
The conclusion is, if you are going to do a simple export and that need to be done daily with no transformation I would recommend Sqoop to be simple to use and will not impact your database that much. Using Spark will work well IF your table is ready for that, besides that goes with Sqoop

Any benefit for my case when using Hive as datawarehouse?

Currently, i am trying to adopt big data to replace my current data analysis platform. My current platform is pretty simple, my system get a lot of structured csv feed files from various upstream systems, then, we load them as java objects (i.e. in memory) for aggregation.
I am looking for using Spark to replace my java object layer for aggregation process.
I understandthat Spark support loading file from hdfs / filesystem. So, Hive as data warehouse seems not a must. However, i can still load my csv files to Hive first, then, use Spark to load data from Hive.
My question here is, in my situation, what's the pros / benefit if i introduce a Hive layer rather than directly loading the csv file to Spark DF.
Thanks.
You can always look and feel the data using the tables.
Adhoc queries/aggregation can be performed using HiveQL.
When accessing that data through Spark, you need not mention the schema of the data separately.

What is the best way to store incoming streaming data?

What is a better choice for a long-term store (many writes, few reads) of data processed through Spark Streaming: Parquet, HBase or Cassandra? Or something else? What are the trade-offs?
In my experience we have used Hbase as datastore for spark streaming data(we also has same scenario many writes and few reads), since we are using hadoop, hbase has native integration with hadoop and it went well..
Above we have used tostore hight rate of messages coming over from solace.
HBase is well suited for doing Range based scans. Casandra is known for availablity and many other things...
However, I can also observe one general trend in many projects, they are simply storing rawdata in hdfs (parquet + avro) in partitioned structure through spark streaming with spark dataframe(SaveMode.Append) and they are processing rawdata with Spark
Ex of partitioned structure in hdfs :
completion ofbusinessdate/environment/businesssubtype/message type etc....
in this case there is no need for going to Hbase or any other data store.
But one common issue in above approach is when you are getting small and tiny files, through streaming then you would need to repartion(1) or colelese or FileUtils.copymerge to meet block size requirements to single partitioned file. Apart from that above approach also would be fine.
Here is some thing called CAP theorm based on which decision can be taken.
Consistency (all nodes see the same data at the same time).
Availability (every request receives a response about whether it
succeeded or failed).
Partition tolerance (the system continues to
operate despite arbitrary partitioning due to network failures)
Casandra supports AP.
Hbase supports CP.
Look at detailed analysis given here

Comparing Cassandra's CQL vs Spark/Shark queries vs Hive/Hadoop (DSE version)

I would like to hear your thoughts and experiences on the usage of CQL and in-memory query engine Spark/Shark. From what I know, CQL processor is running inside Cassandra JVM on each node. Shark/Spark query processor attached with a Cassandra cluster is running outside in a separated cluster. Also, Datastax has DSE version of Cassandra which allows to deploy Hadoop/Hive. The question is in which use case we would pick a specific solution instead of the other.
I will share a few thoughts based on my experience. But, if possible for you, please let us know about your use-case. It'll help us in answering your queries in a better manner.
1- If you are going to have more writes than reads, Cassandra is obviously a good choice. Having said that, if you are coming from SQL background and planning to use Cassandra then you'll definitely find CQL very helpful. But if you need to perform operations like JOIN and GROUP BY, even though CQL solves primitive GROUP BY use cases through write time and compact time sorts and implements one-to-many relationships, CQL is not the answer.
2- Spark SQL (Formerly Shark) is very fast for the two reasons, in-memory processing and planning data pipelines. In-memory processing makes it ~100x faster than Hive. Like Hive, Spark SQL handles larger than memory data types very well and up to 10x faster thanks to planned pipelines. Situation shifts to Spark SQL benefit when multiple data pipelines like filter and groupBy are present. Go for it when you need ad-hoc real time querying. Not suitable when you need long running jobs over gigantic amounts of data.
3- Hive is basically a warehouse that runs on top of your existing Hadoop cluster and provides you SQL like interface to handle your data. But Hive is not suitable for real-time needs. It is best suited for offline batch processing. Doesn't need any additional infra as it uses underlying HDFS for data storage. Go for it when you have to perform operations like JOIN, GROUP BY etc on large dataset and for OLAP.
Note : Spark SQL emulates Apache Hive behavior on top of Spark, so it supports virtually all Hive features but potentially faster. It supports the existing Hive Query language, Hive data formats (SerDes), user-defined functions (UDFs), and queries that call external scripts.
But I think you will be able to evaluate the pros and cons of all these tools properly only after getting your hands dirty. I could just suggest based on your questions.
Hope this answers some of your queries.
P.S. : The above answer is based on solely my experience. Comments/corrections are welcome.
There is a very good effort for benchmark documented here - https://amplab.cs.berkeley.edu/benchmark/

Resources