According to the article by Databricks, it is possible to integrate delta lake with AWS Glue. However, I am not sure if it is possible to do it also outside of Databricks platform. Has someone done that? Also, is it possible to add Delta Lake related metadata using Glue crawlers?
This is not possible. Although you can crawl the S3 delta files outside the databrics platform but you won't find the data in the tables.
As per the doc, it says below :
Warning
Do not use AWS Glue Crawler on the location to define the table in AWS Glue. Delta Lake maintains files corresponding to multiple versions of the table, and querying all the files crawled by Glue will generate incorrect results.
It is finally possible to use AWS Glue Crawlers to detect and catalog Delta Tables.
Here is a blog post explaining how to do it.
I am currently using a solution to generate manifests of Delta tables using Apache Spark (https://docs.delta.io/latest/presto-integration.html#language-python).
I generate a manifest file for each Delta Table using:
deltaTable = DeltaTable.forPath(<path-to-delta-table>)
deltaTable.generate("symlink_format_manifest")
Then created the table using the example below. The DDL below also creates the table inside Glue Catalog; you can then access the data from AWS Glue using Glue Data Catalog.
CREATE EXTERNAL TABLE mytable ([(col_name1 col_datatype1, ...)])
[PARTITIONED BY (col_name2 col_datatype2, ...)]
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '<path-to-delta-table>/_symlink_format_manifest/' -- location of
the generated manifest
It would be better if you could clarify what do you mean by saying "integrate delta lake with AWS Glue"..
At this moment, there is no direct Glue API for Delta lake support, however, you could write customized code using delta lake library to save output as a Delta lake.
To use Crawler to add meta of Delta lakes to Catalog, here is a workaround . The workaround is not pretty and has two major parts.
1) Get the manifest of referenced files of the Delta Lake. You could refer to Delta Lake source code, or play with the logs in _delta_log, or use a brutal method such as
import org.apache.spark.sql.functions.input_file_name
spark.read.format("delta")
.load(<path-to-delta-lake>)
.select(input_file_name)
.distinct
2) Use Scala or Python Glue API and the manifest to create or update table in Catalog.
AWS Glue Crawler allows us to update metadata from delta table transaction logs to Glue metastore.
Ref - https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-delta-lake
But there are a few downsides to it -
It creates a symlink table in Glue metastore
This symlink-based approach wouldn't work well in case of multiple versions of the table, since the manifest file would point to the latest version
There is no identifier in glue metadata to identify if given table is Delta, in case you have different types of tables in your metastore
Any execution engine which access delta table via manifest files, wouldn't be utilizing other auxiliary data in transaction logs like column stats
Yes it is possible but only recently.
See the attached AWS Blog entry for details on this just announced capability.
https://aws.amazon.com/blogs/big-data/introducing-native-delta-lake-table-support-with-aws-glue-crawlers/
Related
I'd like to use Data fusion on GCP as my ETL pipeline manager and store the raw data in GCS using the delta format. Has anyone done this, or does a plugin exist?
Data Fusion has a plugin to read files/objects from a path in a Google Cloud Storage bucket and it does support the Parquet format. One approach can be to use cloud function to convert the delta to parquet and then use it in the data fusion pipeline.
I have been exploring the data lakehouse concept and Delta Lake. Some of its features seem really interesting. Right there on the project home page https://delta.io/ there is a diagram showing Delta Lake running on "your existing data lake" without any mention of Spark. Elsewhere it suggests that Delta Lake indeeds runs on top of Spark. So my question is, can it be run independently from Spark? Can I, for example, set up Delta Lake with S3 buckets for storage in Parquet format, schema validation etc, without using Spark in my architecture?
You might keep an eye on this: https://github.com/delta-io/delta-rs
It's early and currently read-only, but worth watching as the project evolves.
tl;dr No
Delta Lake up to and including 0.8.0 is tightly integrated with Apache Spark so it's impossible to have Delta Lake without Spark.
Do you need to ingest excel and other proprietary formats using glue or allow glue to work crawl your s3 bucket to use these data formats within your data lake?
I have gone through the "Data Lake Foundation on the AWS Cloud" document and am left scratching my head about getting data into the lake. I have a Data Provider with a large set of data stored on their system as excel and access files.
Based on the process flow they would upload the data into the submission s3 bucket, which would set off a series of actions, but there is no etl of the data into a format that would work with the other tools.
Would using these files require using glue on the data that is submitted in the bucket or is there another way to make this data available to other tools such as Athena and redshift spectrum?
Thank you for any light you can shed on this topic.
-Guido
I'm not seeing that can take excel data directly to Data Lake. You might need to convert into CSV/TSV/Json or other formats before loading into Data Lake.
Formats Supported by Redshift Spectrum:
http://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html -- Again I don't see Excel as of now.
Athena Supported File Formats:
http://docs.aws.amazon.com/athena/latest/ug/supported-formats.html -- I don't see Excel also not supported here.
You need to upload the files to S3 either to Use Athena or Redshift Spectrum or even Redshift storage itself.
Uploading Files to S3:
If you have bigger files, you need to use S3 multipart upload to upload quicker. If you want more speed, you need to use S3 accelerator to upload your files.
Querying Big Data with Athena:
You can create external tables with Athena from S3 locations. Once you create external tables, use Athena Sql reference to query your data.
http://docs.aws.amazon.com/athena/latest/ug/language-reference.html
Querying Big Data with Redshift Spectrum:
Similar to Athena, you can create external tables with Redshift. Start querying those tables and get the results on Redshift.
Redshift has lot of commercial tools, I use SQL Workbench. It is free open source and rock solid, supported by AWS.
SQL WorkBench: http://www.sql-workbench.net/
Connecting your WorkBench to Redshift: http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-using-workbench.html
Copying data to Redshift:
Also if you want to take the data storage to Redshift, you can use the copy command to pull the data from S3 and its gets loaded to Redshift.
Copy Command Examples:
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html
Redshift Cluster Size and Number of Nodes:
Before creating Redshift Cluster, check for required size and number of nodes needed. More number of nodes gets query parallely running. One more important factor is how well your data is distributed. (Distribution key and Sort keys)
I have a very good experience with Redshift, getting up to the speed might take sometime.
Hope it helps.
I am trying to migrate an entire table from my RDS instance (MySQL 5.7) to either S3 (csv file) or Hive.
The table has a total of 2TB of data. And it has a BLOB column which stores a zip file (usually 100KB, but it can reach 5MB).
I made some tests with Spark, Sqoop and AWS DMS, but had problems with all of them. I have no experience exporting data from RDS with those tools, so I really appreciate any help.
Which one is the most recommended for this task? And what strategy do you think is more efficient?
You can copy the RDS data to S3 using AWS pipeline. Here is an example which does the very thing.
Once you taken the dump to S3 in csv format it is easy to read the data using spark and register that as Hive Table.
val df = spark.read.csv("s3://...")
df.saveAsTable("mytable") // saves as hive
I have a question regarding best practices for managing permanent tables in Spark. I have been working previously with Databricks, and in that context, Databricks manages permanent tables so you do not have to 'create' or reference them each time a cluster is launched.
Let's say in a Spark cluster session, a permanent table is created with saveAsTable command using option to partition the table. Data is stored in S3 as parquet files.
Next day, a new cluster is created and it needs to access that table for different purposes:
SQL query for exploratory analysis
ETL process for appending a new chunk of data
What is the best way to make saved table available again as the same table with same structure/options/path? Maybe there a way to store hive metastore settings to be reused between spark sessions? Or maybe each time a spark cluster is created, I should do CREATE EXTERNAL TABLE with the correct options to tell the format (parquet), the partitioning and the path?
Furthermore, if I want to access those parquet files from another application, i.e. Apache Impala, is there a way to store and retrieve hive metastore information or the table has to be created again?
Thanks