Execute query on Spark vs Redshift - apache-spark

Our datawarehouse is in Redshift (50TB size). Sometimes business users run big queries (too many joins, inline queries - generated by BI tools such as Tableau). Big queries slow database performance.
It is wise to use Spark on top of Redshift to offload some of the computation outside Redshift?
Or will it be easier and cost effective to increase Redshift computation power by adding more nodes?
If I execute select a.col1, b.col2 from table1 a, table2 b where a.key = b.key in Spark. Tables are connected via JDBC and resides on Redshift, where does actual processing happen (in Spark or Redshift)?

Any queries on the data stored in Amazon Redshift are performed by the Amazon Redshift nodes. While Spark could make an external JDBC call, the SQL will be executed by Redshift.
There are many techniques to optimize Redshift query execution:
Tuning Query Performance
Top 10 Performance Tuning Techniques for Amazon Redshift
Tuning Workload Management parameters to control parallel queries and memory allocation
Start by looking at queries that consume too many resources and determine whether they can be optimized by changing the Sort Key, Distribution Key and Compression Encodings used by each table. Correct use of these parameters can greatly improve Redshift performance.
Then, if many users are running simultaneous queries, check whether it is worth improving Workload Management settings to create separate queues with different memory settings.
Finally, if performance is still a problem, add additional Redshift nodes. The dense compute nodes will offer better performance because they use SSD storage, but it is a higher cost per TB of storage.

Related

HBase to Delta Tables

We are in the process of migrating a Hadoop Workload to Azure Databricks. In the existing Hadoop ecosystem, we have some HBase tables which contains some data(not big). Since, Azure Databricks does not support Hbase, we were planning if we can replace the HBase tables with Delta tables.
Is this technically feasible, if yes, is there any challenges or issues we might face during the migration or in the target system.
It all comes to the access patterns. HBase is OLTP system where you usually operate on individual records (read/insert/update/delete) and expect subsecond (or millisecond) response time. Delta Lake, on other side is OLAP system designed for efficient processing of many records together, but it could be slower when you read individual records, and especially when you update or delete them.
If your application needs subseconds queries, especially with updates, then it make sense to setup a test to check if Delta Lake is the right choice for that - you may need to look onto Databricks SQL that is doing a lot of optimizations for fast data access.
If it won't fulfill your requirements, then you may look onto other products in Azure ecosystem, like, Azure Redis or Azure CosmosDB that are designed for OLTP-style data processing.

Processing of queries using SparkSQL on difference databases

I want to use Spark SQL (installed on Machine 1) with connectors for different data stores like HBase, Hive, Cassandra, and MySQL (installed on Machine 2 to perform simple analytics like Min/Max, averaging, etc.
My Question: Is the processing of these queries done on Machine 1 or Spark SQL acts as just an interface to perform different analytics but on the data store end (ie. Machine 2)?
Yes and no. It depends on your spark job.
Spark SQL is a separate implementation. It is datastore agnostic. When you implement a spark sql job , spark transforms it into something called DAG.
It is a similar technique to a database query plan, but running completely on the spark cluster.
In case of simple min / max, it might be translated into a direct underlying store query. But it might also be translated into something which is preselecting bunch of records, then doing an own data processing. This way it is also possible to join and aggregate data from different data sources.
You can analyze the spark sql plan with common explain statement or via spark web ui.

Using Spark Connector for Databricks and Snowflake on AWS

I'm looking at using both Databricks and Snowflake, connected by the Spark Connector, all running on AWS. I'm struggling to understand the following before triggering a decision:
How well does the Spark Connector perform? (performance, extra costs, compatibility)
What comparisons can be made between Databricks SQL and Snowflake SQL in terms of performance and standards?
What have been the “gotchas” or unfortunate surprises about trying to use both?
Snowflake has invested in the Spark connector's performance and according to benchmarks[0] it performs well.
The SQL dialects are similar. "Databricks SQL maintains compatibility with Apache Spark SQL semantics." [1] "Snowflake supports most of the commands and statements defined in SQL:1999." [2]
I haven't experienced gotchas. I would avoid using different regions. The performance characteristics of DataBricks SQL are different since 6/17 when they made their Photon engine default.
As always, the utility will depend on your use case, for example:
If you were doing analytical DataBricks SQL queries on partitioned compressed Parquet DeltaLake, then the performance ought to be roughly similar to Snowflake -- but if you were doing analytical DataBricks SQL queries against a JDBC MySQL connection then performance of Snowflake should be vastly better.
If you were doing wide table scan style queries (e.g. select * from foo (no where, no limit)) in DataBricks SQL and then doing analysis in a kernel (or something) then switching to Snowflake isn't going to do much for you.
etc
[0] - https://www.snowflake.com/blog/snowflake-connector-for-spark-version-2-6-turbocharges-reads-with-apache-arrow/
[1] - https://docs.databricks.com/sql/release-notes/index.html
[2] - https://docs.snowflake.com/en/sql-reference/intro-summary-sql.html

running interactive sql queries over millions of parquet files

I have millions of streaming parquet files being written . I want to support running ad hoc interactive queries for debugging and analytics purpose ( added bonus if i can run streaming queries for some real time monitoring of key metrics as well).
What is a scalable solution for supporting this.
The two ways I have observed is running spark sql interactively over millions of parquet files (not too familiar with spark ecosystem but does this mean running a spark job for every sql user submits or do i need to run some streaming job and submit queries somehow) and second being using a presto sql engine on top of parquet (not exactly sure how presto ingests new incoming parquet files).
Any recommendations or pros and cons of either approach . Any better solutions considering i have > ~10Tb data produced every day .
Let me address your use cases :
Support running ad hoc interactive queries for debugging and analytics purpose
I would recommend building a presto cluster if you care about minimizing the latency of your queries and are willing to invest in many machines with a large amount of memory.
Reason: Presto would run fully in-memory without touching disk (in most cases)
A Spark Cluster can also do the job, however, it won't be as fast as Presto. The advantage of Spark over presto is its fault tolerance capabilities and its ability to fail over to disk in case of out of memory conditions which may be important for you given that you have too much data.
Run streaming queries for some real-time monitoring of key metrics as well
As long as you have basic queries, you can build dashboards on top of Presto which could run these queries every x minutes.
Having a considerable amount of processing may be a good reason to look at Spark streaming if real-time monitoring is important.
If it isn't then you could build an ETL (using Spark) for calculating your metrics, storing the data as a new hive table and then expose for querying via Presto/SparkSQL again.
How presto ingests new incoming parquet files?
I'm now aware of your architecture, but in any case, you need to provide Presto with a Hive connection (Hive Metastore to be precise).
Hive provides Presto with few schemas attached to the directories where you ingest your data. Presto dynamically sees the new data by default. Spark is not different by the way.
Presto has nothing to do with data ingestion. It only starts its job once the data is there.

What specific benefits can we get by using SparkSQL to access Hive tables compared to using JDBC to read tables from SQL server?

I just got this question while designing the storage part for a Hadoop-based platform. If we want to have data scientists to have access to the tables which have already been stored in a relational database (e.g.SQL-server of a Azure Virtual Machine), then will there be any particular benefits if we import the tables from SQL-server to HDFS (e.g. WASB) and create Hive tables on top of them?
In other words, since Spark allows users to read data from other databases using JDBC,is there any performance improvement if we persist the tables from the database in appropriate format (avro, parquet etc.) in HDFS and use SparkSQL to access them using HQL?
I am sorry if this question has been asked, I have done some research but could not get a comparison between the two methodologies.
I think there will be a big performance improvement as the data is local (assuming Spark is running on same Hadoop cluster where the data is stored on HDFS). Using JDBC if the actions/processing performed is interactive then user has to wait for the data to be loaded through JDBC from another machine (N/W latency and IO throughput) whereas if that is done upfront then user (data scientist) can concentrate on performing the actions straight away.

Resources