Question regarding Spark sql tables and databases - apache-spark

I am new to Spark and am confused regarding the below point:
When we are creating new databases and global tables for our own analysis(using dataframe API or spark sql), where are these getting created/stored? Are these getting stored in Spark memory or in external storage(could be Hive/HDFS/RDBMS..etc) from where Spark is reading data? Does temporary view/local tables only get created in Spark memory?
Thanks!

When we are creating new databases and global tables for our own analysis(using dataframe API or spark sql), where are these getting created/stored?
It depends on your infrastructure. For example:
If you're in an on-prem environment, the underlying data is on HDFS
If you're in Azure Databricks, the underlying data is on Azure Data Storage
If you're in Databricks Cloud, the underlying data is on Amazon S3
Are these getting stored in Spark memory or in external storage(could be Hive/HDFS/RDBMS..etc) from where Spark is reading data?
This data only loaded to memory when you call df.cache()
Does temporary view/local tables only get created in Spark memory?
Yes

Related

How to set up metadata database for Spark SQL?

Hive can have its metadata and stores the tables,columns,partitions information over there.
If I do not want to use the hive.Can we create a metadata for spark same as hive.
I want to query spark SQL (not using dataframe) like Hive (select, from and where) Can we do that? if yes, which relational DB can we use for metadata storage?
Can we create a metadata for spark same as hive.
Spark does this for you and you don't have to use a separate installation of Hive or even just part of it (e.g. a Hive metastore).
Regardless of the installation of Apache Spark you use, Spark SQL uses a Hive metastore internally for the same purpose as Hive does (but the metastore is now part of Spark SQL).
if yes which relational DB can we use for metadata storage?
Anything that Hive supports, e.g. Oracle, MySQL, PostgreSQL. The configuration is pretty much as you would do with a separate Hive installation (which is usually the case in such enterprisey installations).
You may want to read Hive Metastore.
Spark is essentially a distributed computation system instead of a distributed storage. Therefore, we mostly use Spark to do the computation work, which needs the metadata from different storage.
However, Spark internally provides an InMemoryCatalog to store the metadata if it's not configured with Hive.
You can take a look at this for more information.

Redshift with Spark Streaming

I have a Kafka - Spark Streaming application to ingest and process 60K events per min. I need a database to store my transformed dataframes to be accessed by visualization layer. Can Redshift be used for this with Spark Streaming or should Cassandra be used? I will be processing and storing the dataframes in every spark window of 30 seconds. Also I need to read from the datastore in every window. I guess Redhsift is primarily a data warehousing database not for OLTP sort of the processing.. any ideas?
You should check out SnappyData. SnappyData deeply integrates an in-memory database with Spark that allows hybrid OLTP/OLAP applications. You can write Spark Streaming applications on top of Snappy that can update/delete data from the database. Further, because it does not go over a connector, it performs better than the myriad datastores that have Spark connectors and even the native Spark cache. There may be other datastores that offer hybrid OLTP/OLAP applications on Spark in the aforementioned link.
Disclaimer: I am a SnappyData employee.

Spark Permanent Tables Management

I have a question regarding best practices for managing permanent tables in Spark. I have been working previously with Databricks, and in that context, Databricks manages permanent tables so you do not have to 'create' or reference them each time a cluster is launched.
Let's say in a Spark cluster session, a permanent table is created with saveAsTable command using option to partition the table. Data is stored in S3 as parquet files.
Next day, a new cluster is created and it needs to access that table for different purposes:
SQL query for exploratory analysis
ETL process for appending a new chunk of data
What is the best way to make saved table available again as the same table with same structure/options/path? Maybe there a way to store hive metastore settings to be reused between spark sessions? Or maybe each time a spark cluster is created, I should do CREATE EXTERNAL TABLE with the correct options to tell the format (parquet), the partitioning and the path?
Furthermore, if I want to access those parquet files from another application, i.e. Apache Impala, is there a way to store and retrieve hive metastore information or the table has to be created again?
Thanks

What specific benefits can we get by using SparkSQL to access Hive tables compared to using JDBC to read tables from SQL server?

I just got this question while designing the storage part for a Hadoop-based platform. If we want to have data scientists to have access to the tables which have already been stored in a relational database (e.g.SQL-server of a Azure Virtual Machine), then will there be any particular benefits if we import the tables from SQL-server to HDFS (e.g. WASB) and create Hive tables on top of them?
In other words, since Spark allows users to read data from other databases using JDBC,is there any performance improvement if we persist the tables from the database in appropriate format (avro, parquet etc.) in HDFS and use SparkSQL to access them using HQL?
I am sorry if this question has been asked, I have done some research but could not get a comparison between the two methodologies.
I think there will be a big performance improvement as the data is local (assuming Spark is running on same Hadoop cluster where the data is stored on HDFS). Using JDBC if the actions/processing performed is interactive then user has to wait for the data to be loaded through JDBC from another machine (N/W latency and IO throughput) whereas if that is done upfront then user (data scientist) can concentrate on performing the actions straight away.

connecting to spark data frames in tableau

We are trying to generate reports in tableau by spark SQL connectivity, But i found out that we are ultimately connecting to hive meta-store.
If this is the case what are the advantages of this new spark SQL connection. Is there a way to connect to spark data frames that are persisted, from tableau using spark SQL.
The problem here is a Tableau problem more than a Spark problem. Spark SQL Connector launches a Spark job each time you connect to a database. Part of that Spark job loads the underlying Hive table into the distributed memory that Spark manages, and each time you make a change or select on a graph, the refresh has to go a level deeper to Hive metastore to get the data, through Spark. That is how Tableau is designed. The only option here is to change Tableau for Spotfire (or some other tool) where by pre-caching the underlying Hive table, the Spark SQL Connector can query it directly from Spark distributed memory, skipping the load step.
Disclosure: I am in no way associated with Spotfire makers

Resources