Big data analysis on Amazon Aurora RDS - apache-spark

I have a Aurora table that has 500 millions of records .
I need to perform Big data analysis like finding diff between two tables .
Till now i have been doing this using HIVE on files system ,but now we have inserted all files rows into Aurora DB .
But still monthly i need to do the same thing finding diff.
So to this what colud be the best option ?
Exporting Aurora data back to S3 as files and then running HIVE query on that(how much time it might take to export all Aurora rows into S3)?
Can i run HIVE query on Aurora table ?(I guess hive on Aurora does not support)
Running spark SQL on Aurora (how will be the performance ) ?
Or is there any better way to this .

In my opinion Aurora MySQL isn't good option to perform big data analysis. It results from the limitation of MySQL InnoDB and also from additional restrictions on Aurora in relation to MySQL InnoDB. For instance you don't find there such features as data compression or columnar format.
When it comes to Aurora, you can use for instance Aurora Parallel Query, but it doesn't support partitioned tables.
https://aws.amazon.com/blogs/aws/new-parallel-query-for-amazon-aurora/
Other option is to connect directly to Aurora by using AWS Glue and perform the analysis, but in this case you can have problem with the database performance. It can be a bottleneck.
https://docs.aws.amazon.com/glue/latest/dg/populate-add-connection.html
I suggest to import/export the data to s3 by using LOAD DATA FROM S3 / SELECT INTO OUTFILE S3 to S3 and then perform the analysis by using Glue or EMR. You should also consider to use Redshift instead of Aurora.

Related

What is the fastest way to pull massive amounts of data from Snowflake Database into AWS SageMaker?

What would be the fastest way to pull in very large datasets from Snowflake into my SageMaker instance in AWS? How does the snowflake python connector (what I currently use) compare to lets say a spark connector to snowflake?
SageMaker training jobs like S3 as the input source, but you can also use EFS (NFS) or FSx for Lustre, for higher performance
For S3, I'd use AWS Glue to read from Snowflake or use Spark on EMR, and store the data in partitions in S3. Partitioning would allow you to distribute your training across multiple machines, if your algorithm supports it
There's also copy into in Snowflake
Ideally, you'd store in Parquet format, but [gzipped] CSV is the common format for SageMaker built-in algorithms. If you're using your own algorithm, then probably go with Parquet
If you're doing forecasting, you could also use Amazon Forecast, but it can get pricey

Using Spark Connector for Databricks and Snowflake on AWS

I'm looking at using both Databricks and Snowflake, connected by the Spark Connector, all running on AWS. I'm struggling to understand the following before triggering a decision:
How well does the Spark Connector perform? (performance, extra costs, compatibility)
What comparisons can be made between Databricks SQL and Snowflake SQL in terms of performance and standards?
What have been the “gotchas” or unfortunate surprises about trying to use both?
Snowflake has invested in the Spark connector's performance and according to benchmarks[0] it performs well.
The SQL dialects are similar. "Databricks SQL maintains compatibility with Apache Spark SQL semantics." [1] "Snowflake supports most of the commands and statements defined in SQL:1999." [2]
I haven't experienced gotchas. I would avoid using different regions. The performance characteristics of DataBricks SQL are different since 6/17 when they made their Photon engine default.
As always, the utility will depend on your use case, for example:
If you were doing analytical DataBricks SQL queries on partitioned compressed Parquet DeltaLake, then the performance ought to be roughly similar to Snowflake -- but if you were doing analytical DataBricks SQL queries against a JDBC MySQL connection then performance of Snowflake should be vastly better.
If you were doing wide table scan style queries (e.g. select * from foo (no where, no limit)) in DataBricks SQL and then doing analysis in a kernel (or something) then switching to Snowflake isn't going to do much for you.
etc
[0] - https://www.snowflake.com/blog/snowflake-connector-for-spark-version-2-6-turbocharges-reads-with-apache-arrow/
[1] - https://docs.databricks.com/sql/release-notes/index.html
[2] - https://docs.snowflake.com/en/sql-reference/intro-summary-sql.html

AWS Data Lake Ingest

Do you need to ingest excel and other proprietary formats using glue or allow glue to work crawl your s3 bucket to use these data formats within your data lake?
I have gone through the "Data Lake Foundation on the AWS Cloud" document and am left scratching my head about getting data into the lake. I have a Data Provider with a large set of data stored on their system as excel and access files.
Based on the process flow they would upload the data into the submission s3 bucket, which would set off a series of actions, but there is no etl of the data into a format that would work with the other tools.
Would using these files require using glue on the data that is submitted in the bucket or is there another way to make this data available to other tools such as Athena and redshift spectrum?
Thank you for any light you can shed on this topic.
-Guido
I'm not seeing that can take excel data directly to Data Lake. You might need to convert into CSV/TSV/Json or other formats before loading into Data Lake.
Formats Supported by Redshift Spectrum:
http://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html -- Again I don't see Excel as of now.
Athena Supported File Formats:
http://docs.aws.amazon.com/athena/latest/ug/supported-formats.html -- I don't see Excel also not supported here.
You need to upload the files to S3 either to Use Athena or Redshift Spectrum or even Redshift storage itself.
Uploading Files to S3:
If you have bigger files, you need to use S3 multipart upload to upload quicker. If you want more speed, you need to use S3 accelerator to upload your files.
Querying Big Data with Athena:
You can create external tables with Athena from S3 locations. Once you create external tables, use Athena Sql reference to query your data.
http://docs.aws.amazon.com/athena/latest/ug/language-reference.html
Querying Big Data with Redshift Spectrum:
Similar to Athena, you can create external tables with Redshift. Start querying those tables and get the results on Redshift.
Redshift has lot of commercial tools, I use SQL Workbench. It is free open source and rock solid, supported by AWS.
SQL WorkBench: http://www.sql-workbench.net/
Connecting your WorkBench to Redshift: http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-using-workbench.html
Copying data to Redshift:
Also if you want to take the data storage to Redshift, you can use the copy command to pull the data from S3 and its gets loaded to Redshift.
Copy Command Examples:
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html
Redshift Cluster Size and Number of Nodes:
Before creating Redshift Cluster, check for required size and number of nodes needed. More number of nodes gets query parallely running. One more important factor is how well your data is distributed. (Distribution key and Sort keys)
I have a very good experience with Redshift, getting up to the speed might take sometime.
Hope it helps.

How to export a 2TB table from a RDS instance to S3 or Hive?

I am trying to migrate an entire table from my RDS instance (MySQL 5.7) to either S3 (csv file) or Hive.
The table has a total of 2TB of data. And it has a BLOB column which stores a zip file (usually 100KB, but it can reach 5MB).
I made some tests with Spark, Sqoop and AWS DMS, but had problems with all of them. I have no experience exporting data from RDS with those tools, so I really appreciate any help.
Which one is the most recommended for this task? And what strategy do you think is more efficient?
You can copy the RDS data to S3 using AWS pipeline. Here is an example which does the very thing.
Once you taken the dump to S3 in csv format it is easy to read the data using spark and register that as Hive Table.
val df = spark.read.csv("s3://...")
df.saveAsTable("mytable") // saves as hive

Execute query on Spark vs Redshift

Our datawarehouse is in Redshift (50TB size). Sometimes business users run big queries (too many joins, inline queries - generated by BI tools such as Tableau). Big queries slow database performance.
It is wise to use Spark on top of Redshift to offload some of the computation outside Redshift?
Or will it be easier and cost effective to increase Redshift computation power by adding more nodes?
If I execute select a.col1, b.col2 from table1 a, table2 b where a.key = b.key in Spark. Tables are connected via JDBC and resides on Redshift, where does actual processing happen (in Spark or Redshift)?
Any queries on the data stored in Amazon Redshift are performed by the Amazon Redshift nodes. While Spark could make an external JDBC call, the SQL will be executed by Redshift.
There are many techniques to optimize Redshift query execution:
Tuning Query Performance
Top 10 Performance Tuning Techniques for Amazon Redshift
Tuning Workload Management parameters to control parallel queries and memory allocation
Start by looking at queries that consume too many resources and determine whether they can be optimized by changing the Sort Key, Distribution Key and Compression Encodings used by each table. Correct use of these parameters can greatly improve Redshift performance.
Then, if many users are running simultaneous queries, check whether it is worth improving Workload Management settings to create separate queues with different memory settings.
Finally, if performance is still a problem, add additional Redshift nodes. The dense compute nodes will offer better performance because they use SSD storage, but it is a higher cost per TB of storage.

Resources