Is there an efficient way to copy a table from redshift to postgres using nodejs, couldn't find any concrete examples
There does not seem to be any utility pre-written. the process that you must adopt (set up) for anything more than just a few rows is:
Push data to S3
Use AWS Copy command (using SDK) to copy from S3 to Redshift
Transform data in Redshift (optional)
Related
I have one testdata.dmp available in AWS s3 bucket and want to load data into panda dataframe. Looking for some solution, I've boto3 installed.
Your Oracle dump file testdata.dmp has a proprietary binary format maintained by Oracle. This means that Oracle controls which tools can process it correctly. One of such tools is Oracle Data Pump.
A workflow to extract data from a Oracle dump file and write it as Parquet files (readable with Pandas) could look as follows:
Create an Oracle DB. As you are already using AWS S3, I suggest setting up an AWS RDS instance with Oracle engine.
Download testdata.dmp from S3 to the created Oracle DB. This can be done by RDS' S3 integration.
Run Oracle Data Pump Import on the RDS instance. This tool is installed by default. The RDS docs provide a detailed walk-through. Now the content of testdata.dmp lives as tables with data and other objects inside the Oracle DB.
Dump all tables (and other objects) with a tool that is able to query Oracle DBs and able to write the result as Parquet. Some choices:
Sqoop (Hadoop-based command line tool, but deprecated)
(Py)Spark (Popular data processing tool and imho the unofficial successor of Sqoop.)
python-oracledb + Pandas
I connected via SSH to Dev Endpoint in Glue.
There is Spark 2.4.1 running.
I want to run a simple query select * from pg_namespace;
Also after that, want to move data from S3 to Redshift using COPY command.
How to write that in a Spark console?
Thanks.
Am not sure if you can use COPY command directly, and i haven't tried it.
For moving data from S3 to Redshift, you can use AWS Glue APIs. Please check here for sample codes from AWS? Behind the scenes, I think AWS Glue uses COPY / UNLOAD commands for moving data between S3 and REDSHIFT.
You can use aws cli and psql from your ssh terminal.
For psql check https://docs.aws.amazon.com/redshift/latest/mgmt/connecting-from-psql.html
Then u can run select and copy command from it.
But I will not recommend as AWS Glue is serverless service so your cluster will be different everytime.
Do you need to ingest excel and other proprietary formats using glue or allow glue to work crawl your s3 bucket to use these data formats within your data lake?
I have gone through the "Data Lake Foundation on the AWS Cloud" document and am left scratching my head about getting data into the lake. I have a Data Provider with a large set of data stored on their system as excel and access files.
Based on the process flow they would upload the data into the submission s3 bucket, which would set off a series of actions, but there is no etl of the data into a format that would work with the other tools.
Would using these files require using glue on the data that is submitted in the bucket or is there another way to make this data available to other tools such as Athena and redshift spectrum?
Thank you for any light you can shed on this topic.
-Guido
I'm not seeing that can take excel data directly to Data Lake. You might need to convert into CSV/TSV/Json or other formats before loading into Data Lake.
Formats Supported by Redshift Spectrum:
http://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html -- Again I don't see Excel as of now.
Athena Supported File Formats:
http://docs.aws.amazon.com/athena/latest/ug/supported-formats.html -- I don't see Excel also not supported here.
You need to upload the files to S3 either to Use Athena or Redshift Spectrum or even Redshift storage itself.
Uploading Files to S3:
If you have bigger files, you need to use S3 multipart upload to upload quicker. If you want more speed, you need to use S3 accelerator to upload your files.
Querying Big Data with Athena:
You can create external tables with Athena from S3 locations. Once you create external tables, use Athena Sql reference to query your data.
http://docs.aws.amazon.com/athena/latest/ug/language-reference.html
Querying Big Data with Redshift Spectrum:
Similar to Athena, you can create external tables with Redshift. Start querying those tables and get the results on Redshift.
Redshift has lot of commercial tools, I use SQL Workbench. It is free open source and rock solid, supported by AWS.
SQL WorkBench: http://www.sql-workbench.net/
Connecting your WorkBench to Redshift: http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-using-workbench.html
Copying data to Redshift:
Also if you want to take the data storage to Redshift, you can use the copy command to pull the data from S3 and its gets loaded to Redshift.
Copy Command Examples:
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html
Redshift Cluster Size and Number of Nodes:
Before creating Redshift Cluster, check for required size and number of nodes needed. More number of nodes gets query parallely running. One more important factor is how well your data is distributed. (Distribution key and Sort keys)
I have a very good experience with Redshift, getting up to the speed might take sometime.
Hope it helps.
I am trying to migrate an entire table from my RDS instance (MySQL 5.7) to either S3 (csv file) or Hive.
The table has a total of 2TB of data. And it has a BLOB column which stores a zip file (usually 100KB, but it can reach 5MB).
I made some tests with Spark, Sqoop and AWS DMS, but had problems with all of them. I have no experience exporting data from RDS with those tools, so I really appreciate any help.
Which one is the most recommended for this task? And what strategy do you think is more efficient?
You can copy the RDS data to S3 using AWS pipeline. Here is an example which does the very thing.
Once you taken the dump to S3 in csv format it is easy to read the data using spark and register that as Hive Table.
val df = spark.read.csv("s3://...")
df.saveAsTable("mytable") // saves as hive
Is it possible to use Amazon Redshift as the data source for an Excel pivot table? Googling this question didn't yield any obvious answers. Thanks.
Yes I have.
However since the other answers were written, rather than use generic PostGres drivers, you should use customised Redshift Drivers provided by Amazon.
The answers you are looking for are here:
http://docs.aws.amazon.com/redshift/latest/mgmt/configure-odbc-connection.html
You can consume Amazon Redshift databases with the PostGRESQL ODBC drivers.
Download and install driver.
Set up a DSN on the box pointed to your Redshift server with your AWS credentials (you can find the ODBC connection string in the settings area of your cluster.)
Use that connection in Excel or any other product that can connect to ODBC connections.
You can convert Excel to CSV and upload it to S3. Once files are uploaded to S3 you can run copy command to copy data from S3 to Redshift cluster. You can run copy command via PostGRESQL JDBC connector or available tools like SqlWorkbench.