The mechanism behind the join operation between one local table and one DB table - apache-spark

When I register one table from local RDD and one table from DB, I found the join operation between two tables was really slow.
The table from DB is actually a SQL that has multiple join operations, and the local RDD only has 20 records.
I am curious about the mechanism behind it.
Do we pull data from remote DB and execute all tasks in the local Spark cluster?
or Do we have an 'interesting' SQL engine to send an optimized query to DB and wait for the query result back? In my opinion, this way does not make sense, because the query executes really fast in DB.

Spark SQL will run the queries in its side. When you define some tables to join in query in first step it will fetch tables in cluster and save them in memory as RDD or Dataframe, and at last runs some tasks to do query operations.
In your example we suppose the first RDD is already in memory, but the second need to fetch. The datasource will request to your SQL engine to take tables. But before delivering the table, as it is a query with multiple join, your SQL engine will run the query in its side (different from spark cluster) and deliver the results when the table is ready. Let suppose SQL engine take TA seconds to run query (with total result, not top results when you can see in SQL client), and moving data to sql cluster (possibly over the network) takes TB seconds. After TA+TB second data is ready for spark to run.
and If TC is the time for join operation, the total time will be total = TA+TB+TC. You need to check where is you bottleneck. I think TB may be a critical one for large data.
However, when using a cluster with two or more workers, be assure that all nodes involve in the operation. Sometimes, because of wrong coding, the spark will use just one node to do operations. Make sure your data is spread out over the cluster to benefit from data locality.


2 million queries against a dataframe

I need to run 2 million queries against a three columns table t (s,p,o) which size is 10 billions rows. The data type of each column is string.
Only two types of queries:
select s p o from t where s = param
select s p o from t where o = param
If I store the table in a Postgresql database takes 6 hours using a Java ThreadPoolExecutor.
Do you think Spark can speed up the queries processing even more?
What would be the best strategy? These are my ideas:
Load the table into a dataframe and launch the queries against the dataframe.
Load the table into a parquet database and launch the queries against this database.
Use Spark 2.4 to launch queries against the Postgresql database instead of querying directly.
Use Spark 3.0 to launch queries against the database loaded into PG-Strom, an extension module of PostgreSQL with GPU support.
Using Apache Spark on top of the existing MySQL or PostgresSQL server(s) (without the need to export or even stream data to Spark or Hadoop) can increase query performance more than ten times. Using multiple MySQL servers (replication or Percona XtraDB Cluster) gives us an additional performance increase for some queries. You can also use the Spark cache function to cache the whole MySQL query results table.
The idea is simple: Spark can read MySQL or PostgresSQL data via JDBC and can also execute SQL queries, so we can connect it directly to DB's and run the queries. Why is this faster? For long-running (i.e., reporting or BI) queries, it can be much faster as Spark is a massively parallel system. For example, MySQL can only use one CPU core per query, whereas Spark can use all cores on all cluster nodes.
But I recommend you use No-SQL(HBase, Cassandra,...) or New-SQL solutions for your analyses because they have better performance when the scale of your data increase.
Static Data? Spark; Otherwise tune Postgres
If the 10 billion rows are static or rarely updated, your best bet is going to be using Spark with appropriate partitions. The magic happens with parallelization, so the more cores you have, the better. You want to aim for partitions that are about half a gig in size each.
Determine the size of the data by running SELECT pg_size_pretty( pg_total_relation_size('tablename')); Divide the result by the number of cores available to Spark until you get between 1/8 and 3/4 gig.
Save as parquet if you really have static data or if you want to recover from a failure quickly.
If the source data are updated frequently, you're going to want to add indices in Postgres. It could be as straightforward as adding an index on each column. Partitioning in Postgres would also help.
Stick to Postgres. Newer databases are not appropriate for structured data such as yours. There are parallelization options. Aurora, if you're on AWS.
PG-Strom is not going to work for you here. You have simple data with few columns. Getting them into and out of a GPU is going to slow you down too much.

Converting hive query to Spark-SQL

I have a hive query which runs in map-reduce mode per day. The total processing time takes 20 mins which is fine as per our daily process.
I am looking to execute in spark-framework.
To begin with, I have set the execution engine=spark in the hive shell and executed the same query.
The process had transformations and actions and the whole query completed around 8 mins. This query has multpile subqueries, IN Clauses and where conditions.
The question is, how does the spark-environment creates RDDs complex queries(Assume that I have just run the same query as in hive).Does it create RDD for each subqueries?
Now I would want to leverage spark-sql in place of the hive query. How should we approach these kind of complex where in we have lot of subqueries and aggregations involved.
I understand that for relational data computations, we need to leverage data frames.
Would this be a right approach in re-writing them in spark-sql or hold on to the thing which is setting the execution engine = spark and running the hive query.
In case if there are advantages in writing the queries in spark-sql and running on the spark, what would be the advantages.
For all the subqueries and various filter and aggregation logics,what would be performance for data frame APIs.

What role Spark SQL acts? Memory DB?

Recently i come to Spark SQL.
I read the Data Source Api and still confused at what role Spark SQL acts.
When i do SQL on whatever i need, will spark load all the data first and perform sql in memory? That means spark sql is only a memory db that works on data already loaded. Or it scan locally every time?
Really willing to any answers.
Spark SQL is not a database. It is just an interface that allows you to execute SQL-like queries over the data that you store in Spark specific row-based structures called DataFrame
To run a SQL query via Spark, the first requirement is that the table on which you are trying to run a query should be present in either the Hive Metastore (i.e the table should be present in Hive) or it should be a temporary view that is part of the current SQLContext/HiveContext.
So, if you have a dataframe df and you want to run SQL queries over it, you can either use:
df.createOrReplaceTempView("temp_table") // or registerTempTable
and then you can use the SQLContext/HiveContext or the SparkSession to run queries over it.
spark.sql("SELECT * FROM temp_table")
Here's eliasah's answer that explains how createOrReplaceTempView works internally
When i do SQL on whatever i need, will spark load all the data first and perform sql in memory?
The data will be stored in-memory or on disk depending upon the persistence strategy that you use. If you choose to cache the table, the data will get stored in memory and the operations would be considerable faster when compared to the case where data is fetched from the disk. That part is anyway configurable and up to the user. You can basically tell Spark how you want it to store the data.
Spark-sql will only cache the rows that are pulled by the action, this means that it will cache as many partitions as it has to read during the action. this makes your first call much faster than your second call

best failsafe strategy to store result of spark sql for structured streaming and OLAP queries

I would like to store result of continuous queries running against streaming data in such a manner so that results are persisted into distributed nodes to ensure failover and scalability.
Can Spark SQL experts please shed some light on
- (1) which storage option I should choose so that OLAP queries are faster
- (2) how to ensure data available for query even if one node is down
- (3) internally how does Spark SQL store the resultset ?
It depends what kind of latency you can afford.
One way is to persist the result into HDFS/Cassandra using Persist() API. If your data is small then cache() of each RDD should give you a good result.
Store where your spark executors are co-located. For example:
It is also possible to use Memory based storage like tachyon to persist your stream (i.e. each RDD of your stream) and query against it.
If latency is not an issue then persist(MEMORY_OR_DISK_2) should give you what you need. Mind you performance is a hit or miss in that scenario. Also this stores the data in two executors.
In other cases if your clients are more comfortable in OLTP like database where they just need to query the constantly updating result you can use conventional database like postgres or mysql. This is a preferred method among many as query time is consistent and predictable. If the result is not update heavy but partitioned (say by time) then Greenplum like systems are also a choice.

Spark SQL performance is very bad

I want to use SPARK SQL. I found the performance is very bad.
In my first solution:
when each SQL query coming,
load the data from hbase entity to a dataRDD,
then register this dataRDD to SQLcontext.
at last execute spark SQL query.
Apparently the solution is very bad because it needs to load data every time.
So I improved the first solution.
In my second solution don't consider hbase data updates and inserts:
When app starts, load the current data from HBASE entity to a dataRDD, named cachedDataRDD.
Register cachedDataRDD to SQLcontext
When each SQL query coming, execute spark SQL query. The performance is very good.
But some entity needs to consider the updates and inserts.
So I changed the solution base on the second solution.
In my third solution need to consider the hbase data updates and inserts:
When app starts, load the current data from HBASE entity to a dataRDD, named cachedDataRDD.
When SQL query coming, load the new updates and inserts data to another dataRDD, named newDataRDD.
Then set cachedDataRDD = cachedDataRDD.union(dataRDD);
Register cachedDataRDD to SQLcontext
At last execute spark SQL query.
But I found the union transformation will cause the collect action for getting the query result is very slow. Much slow than the hbase api query.
Is there any way to tune the third solution performance? Usually under what condition to use the spark SQL is better? Is any good use case of using spark SQL? Thanks
Consider creating a new table for newDataRDD and do the UNION on the Spark SQL side. So for example, instead of unioning the RDD, do:
This should provide more information to the query optimizer and hopefully help make your query faster.
