AWS Data Lake Ingest

AWS Data Lake Ingest - excel

Do you need to ingest excel and other proprietary formats using glue or allow glue to work crawl your s3 bucket to use these data formats within your data lake?
I have gone through the "Data Lake Foundation on the AWS Cloud" document and am left scratching my head about getting data into the lake. I have a Data Provider with a large set of data stored on their system as excel and access files.
Based on the process flow they would upload the data into the submission s3 bucket, which would set off a series of actions, but there is no etl of the data into a format that would work with the other tools.
Would using these files require using glue on the data that is submitted in the bucket or is there another way to make this data available to other tools such as Athena and redshift spectrum?
Thank you for any light you can shed on this topic.
-Guido

I'm not seeing that can take excel data directly to Data Lake. You might need to convert into CSV/TSV/Json or other formats before loading into Data Lake.
Formats Supported by Redshift Spectrum:
http://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html -- Again I don't see Excel as of now.
Athena Supported File Formats:
http://docs.aws.amazon.com/athena/latest/ug/supported-formats.html -- I don't see Excel also not supported here.
You need to upload the files to S3 either to Use Athena or Redshift Spectrum or even Redshift storage itself.
Uploading Files to S3:
If you have bigger files, you need to use S3 multipart upload to upload quicker. If you want more speed, you need to use S3 accelerator to upload your files.
Querying Big Data with Athena:
You can create external tables with Athena from S3 locations. Once you create external tables, use Athena Sql reference to query your data.
http://docs.aws.amazon.com/athena/latest/ug/language-reference.html
Querying Big Data with Redshift Spectrum:
Similar to Athena, you can create external tables with Redshift. Start querying those tables and get the results on Redshift.
Redshift has lot of commercial tools, I use SQL Workbench. It is free open source and rock solid, supported by AWS.
SQL WorkBench: http://www.sql-workbench.net/
Connecting your WorkBench to Redshift: http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-using-workbench.html
Copying data to Redshift:
Also if you want to take the data storage to Redshift, you can use the copy command to pull the data from S3 and its gets loaded to Redshift.
Copy Command Examples:
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html
Redshift Cluster Size and Number of Nodes:
Before creating Redshift Cluster, check for required size and number of nodes needed. More number of nodes gets query parallely running. One more important factor is how well your data is distributed. (Distribution key and Sort keys)
I have a very good experience with Redshift, getting up to the speed might take sometime.
Hope it helps.

Related

Ingest JDBC/ODBC data to Snowflake

Does Snowflake support JDBC data sources, and if so how? I'm using Netsuite Analytics as a datasource and would like to load that to a Snowflake warehouse. The examples I'm finding for SnowFlake are file readers, I realise I can convert my netsuite data to a file and then ingest that but I'd rather remove that additonal step.

Snowflake has both ODBC and JDBC drivers that you can use. However, if you are loading a lot of data from Netsuite Analytics, most of the Snowflake drivers will actually generate files, PUT them to S3, and execute a COPY INTO statement to get the data into Snowflake for you. While it is more seamless, it is still executing that "additional step". The reason is...that's the most efficient way to get data into Snowflake, and it's not even close.
https://docs.snowflake.com/en/user-guide/odbc.html
https://docs.snowflake.com/en/user-guide/jdbc.html

No, Snowflake doesn't offer tools for loading data from JDBC or ODBC data sources. This is because Snowflake is a database platform and the functionality you're describing is that of a data integration or ETL tool. There are plenty of third party tools available that can handle this such as Matillion or Talend. Snowflake has a list of recommended technology partners on their website.
If you don't have access to an ETL tool then, as you mentioned, you can create a process yourself to export data from Netsuite to files that are uploaded to cloud storage such AWS S3. You can then set up this storage area an "external stage" and use Snowflake's COPY statement to load the data into Snowflake.

Can AWS Glue crawl Delta Lake table data?

According to the article by Databricks, it is possible to integrate delta lake with AWS Glue. However, I am not sure if it is possible to do it also outside of Databricks platform. Has someone done that? Also, is it possible to add Delta Lake related metadata using Glue crawlers?

This is not possible. Although you can crawl the S3 delta files outside the databrics platform but you won't find the data in the tables.
As per the doc, it says below :
Warning
Do not use AWS Glue Crawler on the location to define the table in AWS Glue. Delta Lake maintains files corresponding to multiple versions of the table, and querying all the files crawled by Glue will generate incorrect results.

It is finally possible to use AWS Glue Crawlers to detect and catalog Delta Tables.
Here is a blog post explaining how to do it.

I am currently using a solution to generate manifests of Delta tables using Apache Spark (https://docs.delta.io/latest/presto-integration.html#language-python).
I generate a manifest file for each Delta Table using:
deltaTable = DeltaTable.forPath(<path-to-delta-table>)
deltaTable.generate("symlink_format_manifest")
Then created the table using the example below. The DDL below also creates the table inside Glue Catalog; you can then access the data from AWS Glue using Glue Data Catalog.
CREATE EXTERNAL TABLE mytable ([(col_name1 col_datatype1, ...)])
[PARTITIONED BY (col_name2 col_datatype2, ...)]
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '<path-to-delta-table>/_symlink_format_manifest/' -- location of
the generated manifest

It would be better if you could clarify what do you mean by saying "integrate delta lake with AWS Glue"..
At this moment, there is no direct Glue API for Delta lake support, however, you could write customized code using delta lake library to save output as a Delta lake.
To use Crawler to add meta of Delta lakes to Catalog, here is a workaround . The workaround is not pretty and has two major parts.
1) Get the manifest of referenced files of the Delta Lake. You could refer to Delta Lake source code, or play with the logs in _delta_log, or use a brutal method such as
import org.apache.spark.sql.functions.input_file_name
spark.read.format("delta")
.load(<path-to-delta-lake>)
.select(input_file_name)
.distinct
2) Use Scala or Python Glue API and the manifest to create or update table in Catalog.

AWS Glue Crawler allows us to update metadata from delta table transaction logs to Glue metastore.
Ref - https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-delta-lake
But there are a few downsides to it -
It creates a symlink table in Glue metastore
This symlink-based approach wouldn't work well in case of multiple versions of the table, since the manifest file would point to the latest version
There is no identifier in glue metadata to identify if given table is Delta, in case you have different types of tables in your metastore
Any execution engine which access delta table via manifest files, wouldn't be utilizing other auxiliary data in transaction logs like column stats

Yes it is possible but only recently.
See the attached AWS Blog entry for details on this just announced capability.
https://aws.amazon.com/blogs/big-data/introducing-native-delta-lake-table-support-with-aws-glue-crawlers/

Big data analysis on Amazon Aurora RDS

I have a Aurora table that has 500 millions of records .
I need to perform Big data analysis like finding diff between two tables .
Till now i have been doing this using HIVE on files system ,but now we have inserted all files rows into Aurora DB .
But still monthly i need to do the same thing finding diff.
So to this what colud be the best option ?
Exporting Aurora data back to S3 as files and then running HIVE query on that(how much time it might take to export all Aurora rows into S3)?
Can i run HIVE query on Aurora table ?(I guess hive on Aurora does not support)
Running spark SQL on Aurora (how will be the performance ) ?
Or is there any better way to this .

In my opinion Aurora MySQL isn't good option to perform big data analysis. It results from the limitation of MySQL InnoDB and also from additional restrictions on Aurora in relation to MySQL InnoDB. For instance you don't find there such features as data compression or columnar format.
When it comes to Aurora, you can use for instance Aurora Parallel Query, but it doesn't support partitioned tables.
https://aws.amazon.com/blogs/aws/new-parallel-query-for-amazon-aurora/
Other option is to connect directly to Aurora by using AWS Glue and perform the analysis, but in this case you can have problem with the database performance. It can be a bottleneck.
https://docs.aws.amazon.com/glue/latest/dg/populate-add-connection.html
I suggest to import/export the data to s3 by using LOAD DATA FROM S3 / SELECT INTO OUTFILE S3 to S3 and then perform the analysis by using Glue or EMR. You should also consider to use Redshift instead of Aurora.

How to export a 2TB table from a RDS instance to S3 or Hive?

I am trying to migrate an entire table from my RDS instance (MySQL 5.7) to either S3 (csv file) or Hive.
The table has a total of 2TB of data. And it has a BLOB column which stores a zip file (usually 100KB, but it can reach 5MB).
I made some tests with Spark, Sqoop and AWS DMS, but had problems with all of them. I have no experience exporting data from RDS with those tools, so I really appreciate any help.
Which one is the most recommended for this task? And what strategy do you think is more efficient?

You can copy the RDS data to S3 using AWS pipeline. Here is an example which does the very thing.
Once you taken the dump to S3 in csv format it is easy to read the data using spark and register that as Hive Table.
val df = spark.read.csv("s3://...")
df.saveAsTable("mytable") // saves as hive

For Hadoop, which data storage to choose, Amazon S3 or Azure Blob Store?

I am working on a Hadoop project and generating lots of data in my local cluster. Sooner later I will be using cloud based Hadoop solution because my Hadoop cluster is very small comparative to real work load, however I dont have a choice as of now which one I will be using i.e. Windows Azure based, EMR or something else. I am generating lots of data locally and want to store this data to some cloud based storage based on the fact that I will use this data with Hadoop later but very soon.
I am looking for suggestion to decided which cloud store to choose based in someone experience. Thanks in advance.

First of all it is a great question. Let's try to understand "How data is processed in Hadoop":
In Hadoop all the data is processed on Hadoop cluster means when you process any data, that data is copied from its sources to HDFS, which is an essential component of Hadoop.
When data is copied to HDFS only after your run Map/Reduce jobs in it to get your results.
That means it does not matter what and where your data sources is(Amazon S3, Azure Blob, SQL Azure, SQL Server, on premise source etc), you will have to move/transfer/copy your data from source to HDFS, within the limits of Hadoop.
Once data is processed in Hadoop cluster, the result will be stored the location you would have configured in your job. The output data source can be HDFS or an outside location accessible from Hadoop Cluster
Once you have data copied to HDFS you can keep it one HDFS as long as you want but you will have to pay the price to use the Hadoop cluster.
In some cases when you are running Hadoop Job between some interval and data move/copy can be done faster, it is good to have a strategy to 1) acquire Hadoop cluster 2) copy data 3) run job 4) release cluster.
So based on above details, when you choose a data source in Cloud for your Hadoop Cluster you would have to consider the following:
If you have large data (which is normal with Hadoop clusters) to process, consider different data sources and the time it will take to copy/move data from those data source to HDFS because this will be your first step.
You would need to choose a data source which must have the lowest network latency so you can get data in and out, as fast as possible.
You also need to consider how you will move large amount of data from your current location to any cloud store. The best option would be to have a storage where you can send your data disk (HDD/Tape etc) because uploading multiple TB data will take great amount of time.
Amazon EMR (already available), Windows Azure (HadoopOnAzure in CTP) and Google (BigQuery in Preview, based on Google Dremel) provides pre-configured Hadoop clusters in cloud so you can choose where you would want to run your Hadoop job then you can consider the cloud storage.
Even if you choose one cloud data storage and decide to move to other because you want to use other Hadoop cluster in cloud, you sure can transfer the data however consider the time and data transfer support available to you.
For example, with HadooponAzure you can connect various data sources i.e. Amazon S3, Azure Blob Storage, SQL Server and SQL Azure etc so a variety of data sources are the best with any cloud Hadoop cluster.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string