List of Spark SQL supported data stores - apache-spark

I am very beginner here. Sorry If I asked duplicate/silly question.
Coming to point, as my product(Java web application) demands, I need to write some application which should push data to any of data stores(based on some configuration). The data store can be RDBMS, Hive or any NoSQL data store. So the query is, is SparkSql is best fit for my case, if yes, can I have list of data stores supported by Spark SQL. If Spark won't do this, are they any other approaches.
Please help me!

Yes! SparkSql(Spark) is the best fit for your usecase.
As per my knowledge, SparkSql supports RDBMS, Hive, and any NoSQL data store.
SparkSQL may not have APIs to directly access few stores but with a little help from Spark's API, you should be able to connect any data store.
We have been using Spark to connect to RDBMS, Cassandra, HBase, ElasticSearch, Solr, Hive, S3, HDFS etc.

Related

Understanging kappa architecture with apache superset

There is a lot of information about kappa architecture in the internet and after going through some of the conceptual aspects I am trying to drill down to something more concrete. As I main source I used this website.
Let's imaging you want to implement a kappa architecture involving the following tech stack:
Apache Kafka
Apache Spark
Apache Superset
Now imagine the application you want to build do data-analytics against has a PostgreSQL database. Of course you can easily directly connect apache superset with the PostgresSQL database and create charts.
But now you want to see how you would do this with a kappa architecture and you add kafka and spark.
You can emit events to kafka and you can read such events in apache spark. Kafka will retain messages for topcis a certain period as pointed out in the answers to this quesition. When I read about connecting superset with spark in the docs it says hive should be used as a connector (also the project websites states the tool is unsupported, and if you look at this issue on pyhive then you find impyla could be an alternative). But apache hive is a completely different project for a storage system. So how would this connection work?
Assuming you have kafka nodes running (with zookeper obviously) and also have spark running and then you connect apache superset through this hive connector with spark.
How can you write queries against the data that is in kafka (which is in fact the live data)?
On spark side itself you can easily write a scala program that reads data from kafka and does something with it but how can you achieve this from apache superset?
Or is this not the intended way of connecting the things?
If I understood your question, you'd need to use Spark Structured Streaming to register a streaming SQL table into the Hive metastore, which could be queried from Superset from the Spark Thiftserver.
Hive itself doesn't store any of the data. Hive also has a built-in Kafka query handler, so Spark isn't completely necessary.
But, Hive/Spark isn't the only option. You could use Spark to write to HDFS/S3 and have Presto query that from Superset.
Or you can remove Spark and use Kafka Connect write to any other thing that a dashboarding tool (Tableau is another popular one) can support - JDBC database (i.e. Postgres), Mongo, Cassandra, etc. Then you'd just refresh the panels to run a new query.

How different and efficient AlibabaTable Store is When compared with Apache Cassandra?

How different and efficient AlibabaTable Store is When compared with Apache Cassandra? I understand both are NoSQL Database. Can anyone please elaborate where and when Alibaba Table Store is preferred instead of Apache Cassandra.
You can think of Alibaba Cloud Table store as the Apache Cassandra because Table store checks all the requirements of Cassandra
The next thing when we talk about the benefits of Table Store compared to Cassandra, you need not worry about the below things when we use Table Store:
Scalability
multi-datacenter replication
Distributed
MapReduce support
Fault-tolerant
Well, Alibaba Cloud may not be using Cassandra at the backend, there is no mention of that.
All the scenarios where Cassandra is used, you can replace it with Table Store. But again I have not extensively worked with application involved Apache Cassandra.
If you read sample code for filtering, you will realized the differences with Cassandra. You will need use different data modelling in table store.

Spark goodness with Cassandra?

I've been reading about Apache Cassandra lately to learn how it works and how to use it for IoT projects, especially in the need of time series based database..
However, I started to notice that Apache Spark is often mentioned when people talk about Cassandra too.
The question is, as long as I can use Cassandra cluster of nodes to serve my app, to store and read data, why would I need Apache Spark? any useful use-cases are appreciated!
The answer is broad but summarizing ... Cassandra is highly scalable and there are lot of scenarios where it fits but CQL sintax has some limitations if you don't have your schema ready for some queries.
If you want to make use of your data without restrictions and doing analytical workloads with your cassandra data or join with other tables Spark is the most appropriate complement. Spark has a tight integration with Cassandra.
I recommend you to check this slides: http://www.slideshare.net/patrickmcfadin/apache-cassandra-and-spark-you-got-the-the-lighter-lets-start-the-fire?qid=48e2528c-a03c-49b4-879e-45599b2aff34&v=&b=&from_search=5
Cassandra is for storing data where as Spark is for performing some computation on top of it. Analogy with Hadoop: Cassandra is like HDFS where as Spark is like Map Reduce.
Especially with computations, when using DataStax Cassandra connector, data locality can be exploited. If you need to do some computation which modifies a row (but doesn't really depend on anything else), then that operation is optimized to run locally on each machine in cluster without any data movement in network.
Same goes with a lot of other Spark workload, the actions(some function which modifies the data) are done locally and only result is sent to client. As far as I know, when you want to do analytics on top of data stored in Cassandra, Spark is well supported and popular choice. If you don't need to do any operations on the data, still you can use Spark for other purposes like I mentioned below.
Spark streaming can be used to ingest or export data from Cassandra ( I used it a lot personally). The same data import/export can be achieved with small hand-written JDBC agents but Spark streaming code I wrote for ingesting 10GB data from Cassandra contains less than 20 lines of code with multi machine-multi threading built-in and an admin UI where I can see the job progress.
With Spark+Zeppelin, we can visualize Cassandra data using Spark, we can build beautiful UIs with little Spark code where users can even enter input and see the result as graph/table etc.
Note: Actually, visualization can be better with Kibana/ElasticSearch or Solr/Banana when used with Cassandra but they are very hard to setup and indexing has it's own issues to deal with.
There are a lot of other use cases, but personally I used Spark as a Swiss army knife for multiple tasks.
Apache cassandra is have feature like fast read and write so you can use it with the apache spark streaming to write your data directly into cassandra without legacy.
For use case you can consider any video application to upload video with the help of streaming and directly store it into cassandra blob.

What specific benefits can we get by using SparkSQL to access Hive tables compared to using JDBC to read tables from SQL server?

I just got this question while designing the storage part for a Hadoop-based platform. If we want to have data scientists to have access to the tables which have already been stored in a relational database (e.g.SQL-server of a Azure Virtual Machine), then will there be any particular benefits if we import the tables from SQL-server to HDFS (e.g. WASB) and create Hive tables on top of them?
In other words, since Spark allows users to read data from other databases using JDBC,is there any performance improvement if we persist the tables from the database in appropriate format (avro, parquet etc.) in HDFS and use SparkSQL to access them using HQL?
I am sorry if this question has been asked, I have done some research but could not get a comparison between the two methodologies.
I think there will be a big performance improvement as the data is local (assuming Spark is running on same Hadoop cluster where the data is stored on HDFS). Using JDBC if the actions/processing performed is interactive then user has to wait for the data to be loaded through JDBC from another machine (N/W latency and IO throughput) whereas if that is done upfront then user (data scientist) can concentrate on performing the actions straight away.

HBase or Cassandra?

In my lambda architecture, i am debating on whether to use HDFS or Cassandra to store my immutable data. I need Cassandra to serve the online requests etc. so it is the mandatory part of the tech stack. Now, I do not want to introduce new tool (HDFS) into the stack if I don't have to. So my question is, what will I be missing if I don't use HDFS and use Cassandra to host my immutable data as well.
EDIT:
I understand HDFS is a distributed filesystem and Cassandra is NoSQL DB. Still, both support data replication, both support high-throughput writes. In addition Cassandra supports low latent data retrieval. So am I right saying that HDFS isn't going to provide me much lift?
As I understand You are trying to clarify your Serving Layer of your Lambda Architecture.
If it is true, you want to store your batch views and real-time views into a Database.
And as I understand you do not have Hadoop cluster in your batch layer.
And your batch views have not been completed in HDFS.
At this point your architecture is outside of HDFS.
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an open-source project and is horizontally scalable.
If you dont want a hadoop cluster, omit HBase.
Cassandra is distributed NoSQL Database(column-oriented) and it works outside the Hadoop cluster and HDFS.
If I understand your architecture and your needs right, I think Cassandra is best for you.
Additionally, you can get quick info about Lambda architecture from this link;
http://artofbigdata.blogspot.com.tr/2016/01/lambda-architecture.html
HDFS supports different file formats to store. For example, sequence files, Avro and Parquet etc..so that you can choose a file format suitable to your application needs.
Also note that you can efficiently read the data using SQL-like queries.
So different data models are available in HDFS over Cassandra to host the data.

Resources