What role Spark SQL acts? Memory DB? - apache-spark

Recently i come to Spark SQL.
I read the Data Source Api and still confused at what role Spark SQL acts.
When i do SQL on whatever i need, will spark load all the data first and perform sql in memory? That means spark sql is only a memory db that works on data already loaded. Or it scan locally every time?
Really willing to any answers.
Best Regards.

I read the Data Source Api and still confused at what role Spark SQL acts.
Spark SQL is not a database. It is just an interface that allows you to execute SQL-like queries over the data that you store in Spark specific row-based structures called DataFrame
To run a SQL query via Spark, the first requirement is that the table on which you are trying to run a query should be present in either the Hive Metastore (i.e the table should be present in Hive) or it should be a temporary view that is part of the current SQLContext/HiveContext.
So, if you have a dataframe df and you want to run SQL queries over it, you can either use:
df.createOrReplaceTempView("temp_table") // or registerTempTable
and then you can use the SQLContext/HiveContext or the SparkSession to run queries over it.
spark.sql("SELECT * FROM temp_table")
Here's eliasah's answer that explains how createOrReplaceTempView works internally
When i do SQL on whatever i need, will spark load all the data first and perform sql in memory?
The data will be stored in-memory or on disk depending upon the persistence strategy that you use. If you choose to cache the table, the data will get stored in memory and the operations would be considerable faster when compared to the case where data is fetched from the disk. That part is anyway configurable and up to the user. You can basically tell Spark how you want it to store the data.

Spark-sql will only cache the rows that are pulled by the action, this means that it will cache as many partitions as it has to read during the action. this makes your first call much faster than your second call

Related

The mechanism behind the join operation between one local table and one DB table

When I register one table from local RDD and one table from DB, I found the join operation between two tables was really slow.
The table from DB is actually a SQL that has multiple join operations, and the local RDD only has 20 records.
I am curious about the mechanism behind it.
Do we pull data from remote DB and execute all tasks in the local Spark cluster?
or Do we have an 'interesting' SQL engine to send an optimized query to DB and wait for the query result back? In my opinion, this way does not make sense, because the query executes really fast in DB.
Spark SQL will run the queries in its side. When you define some tables to join in query in first step it will fetch tables in cluster and save them in memory as RDD or Dataframe, and at last runs some tasks to do query operations.
In your example we suppose the first RDD is already in memory, but the second need to fetch. The datasource will request to your SQL engine to take tables. But before delivering the table, as it is a query with multiple join, your SQL engine will run the query in its side (different from spark cluster) and deliver the results when the table is ready. Let suppose SQL engine take TA seconds to run query (with total result, not top results when you can see in SQL client), and moving data to sql cluster (possibly over the network) takes TB seconds. After TA+TB second data is ready for spark to run.
and If TC is the time for join operation, the total time will be total = TA+TB+TC. You need to check where is you bottleneck. I think TB may be a critical one for large data.
However, when using a cluster with two or more workers, be assure that all nodes involve in the operation. Sometimes, because of wrong coding, the spark will use just one node to do operations. Make sure your data is spread out over the cluster to benefit from data locality.

Spark-SQL- managed tables needs & features

I have below questions on spark managed tables.
Does Spark tables/view are for long term persistent storage? If Yes, when we create a table, where does physical data gets stored?
What is the exact purpose of spark tables as opposed to dataframe?
If we create a spark tables for temporary purpose, are we not filling up disk space that otherwise used for Spark compute needs in jobs?
Spark view (Temp view and Global Temp View) are not for long term persistent storage. they will only alive till your spark session is active. and when you write data in one of your spark/hive table using save() method it will write the data in spark_warehouse folder based on your system path of spark.
DataFrames are more of a programmatically designed but if their are developer who is not comfortable with programming languages like Scala, Python and R so for them spark exposed the SQL based API. But under the hood all of them have same performance because they all need to pass through the same Catalyst Optimizer.
3.Spark manages all the tables in it's catalog and catalog is connected with it's metastore. Now whenever we create any temp table it will copy it's data to the catalog or in spark_warehouse folder rather it will make a reference to that dataframe from which we are creating the tempview. when this spark session will end this temporary views are also deleted from catalog information.

Can Apache Spark be used in place of Sqoop

I have tried connecting spark with JDBC connections to fetch data from MySQL / Teradata or similar RDBMS and was able analyse the data.
Can spark be used to store the data to HDFS?
Is there any possibility for spark outperforming
the activities of Sqoop.
Looking for you valuable answers and explanations.
There are two main things about Sqoop and Spark. The main difference is Sqoop will read the data from your RDMS doesn't matter what you have and you don't need to worry much about how you table is configured.
With Spark using JDBC connection is a little bit different how you need to load the data. If your database doesn't have any column like numeric ID or timestamp Spark will load ALL the data in one single partition. And then will try to process and save. If you have one column to use as partition than Spark sometimes can be even faster than Sqoop.
I would recommend you to take a look in this doc.enter link description here
The conclusion is, if you are going to do a simple export and that need to be done daily with no transformation I would recommend Sqoop to be simple to use and will not impact your database that much. Using Spark will work well IF your table is ready for that, besides that goes with Sqoop

best failsafe strategy to store result of spark sql for structured streaming and OLAP queries

I would like to store result of continuous queries running against streaming data in such a manner so that results are persisted into distributed nodes to ensure failover and scalability.
Can Spark SQL experts please shed some light on
- (1) which storage option I should choose so that OLAP queries are faster
- (2) how to ensure data available for query even if one node is down
- (3) internally how does Spark SQL store the resultset ?
Thanks
Kaniska
It depends what kind of latency you can afford.
One way is to persist the result into HDFS/Cassandra using Persist() API. If your data is small then cache() of each RDD should give you a good result.
Store where your spark executors are co-located. For example:
It is also possible to use Memory based storage like tachyon to persist your stream (i.e. each RDD of your stream) and query against it.
If latency is not an issue then persist(MEMORY_OR_DISK_2) should give you what you need. Mind you performance is a hit or miss in that scenario. Also this stores the data in two executors.
In other cases if your clients are more comfortable in OLTP like database where they just need to query the constantly updating result you can use conventional database like postgres or mysql. This is a preferred method among many as query time is consistent and predictable. If the result is not update heavy but partitioned (say by time) then Greenplum like systems are also a choice.

Spark SQL performance is very bad

I want to use SPARK SQL. I found the performance is very bad.
In my first solution:
when each SQL query coming,
load the data from hbase entity to a dataRDD,
then register this dataRDD to SQLcontext.
at last execute spark SQL query.
Apparently the solution is very bad because it needs to load data every time.
So I improved the first solution.
In my second solution don't consider hbase data updates and inserts:
When app starts, load the current data from HBASE entity to a dataRDD, named cachedDataRDD.
Register cachedDataRDD to SQLcontext
When each SQL query coming, execute spark SQL query. The performance is very good.
But some entity needs to consider the updates and inserts.
So I changed the solution base on the second solution.
In my third solution need to consider the hbase data updates and inserts:
When app starts, load the current data from HBASE entity to a dataRDD, named cachedDataRDD.
When SQL query coming, load the new updates and inserts data to another dataRDD, named newDataRDD.
Then set cachedDataRDD = cachedDataRDD.union(dataRDD);
Register cachedDataRDD to SQLcontext
At last execute spark SQL query.
But I found the union transformation will cause the collect action for getting the query result is very slow. Much slow than the hbase api query.
Is there any way to tune the third solution performance? Usually under what condition to use the spark SQL is better? Is any good use case of using spark SQL? Thanks
Consider creating a new table for newDataRDD and do the UNION on the Spark SQL side. So for example, instead of unioning the RDD, do:
SELECT * FROM data
UNION
SELECT * FROM newData
This should provide more information to the query optimizer and hopefully help make your query faster.

Resources