Loading data from AWS redshift using python - python-3.x

I'm facing a mission impossible to extract a huge amount of data from Amazone Redshift to another table. It definitely requires a more efficient approach but I'm new to SQL and AWS so decided to ask this smart community for advice.
This is my initial SQL query which takes forever:
-- STEP 1: CREATE A SAMPLE FOR ONE MONTH
SELECT DISTINCT at_id, utc_time, name
INTO my_new_table
FROM s3_db.table_x
WHERE type = 'create'
AND (dt BETWEEN '20181001' AND '20181031');
What would be the best approach? I was thinking of using python and sqlalchemy to create dataframes with chunks of 1m rows and inserting it back into the new table (which I need to create beforehand). Would this work?:
from sqlalchemy import create_engine
import os
import pandas as pd
redshift_user = os.environ['REDSHIFT_USER']
redshift_password = os.environ['REDSHIFT_PASSWORD']
engine_string = "postgresql+psycopg2://%s:%s#%s:%d/%s" \
% (redshift_user, redshift_password, 'localhost', XXXX, 'redshiftdb')
engine = create_engine(engine_string)
for df in pd.read_sql_query("""
SELECT DISTINCT at_id, utc_time, name
INSERT INTO my_new_table
FROM s3_db.table_x
WHERE type = 'create'
AND (dt BETWEEN '20181001' AND '20181031');
""", engine, chunksize=1000000):

You should use CREATE TABLE AS.
This allows you to specify a SELECT statement and have the results directly stored into a new table.
This is hugely more efficient than downloading data and re-uploading.
You can also CREATE TABLE LIKE and then load it with data. See: Performing a Deep Copy
You could also UNLOAD data to Amazon S3, then load it again via COPY, but using CREATE TABLE AS is definitely the best option.

Please refer AWS guidelines for RedShift and Spectrum best practices; I've put the links at the end of this post. Based on your question, I am assuming you want to extract, transform and load huge amount of data from RedShift Spectrum based table "s3_db.table_x" to new RedShift table "my_new_table"
Here are some suggestions based on AWS recommendations:
Create your RedShift table with appropriate distribution key, sort key and compression encoding. At high level, "at_id" seems best suited as partition key and "utc_time" as sortkey for your requirement, but make sure to refer AWS guidelines for RedShift table design 3.
As you mentioned, your data volume is huge, you may like to have your S3 source table "s3_db.table_x" partitioned based on "type" and "dt" columns (as suggested at point number 4 in spectrum best practices 1).
Replace DISTINCTwith GROUP BY in the select query from Spectrum (point number 9 in Spectrum Best Practices 1).
AWS recommends (point number 7 in Spectrum best practices 1) to simplify your ETL process using CREATE TABLE AS SELECT or SELECT INTO statements, wherein you may put your transformation logic in the select component to load data directly form S3 to RedShift.
redshift spectrum best practices
redshift best practices
redshift table design playbook

It now appears that your source data is stored in Amazon S3 and you have been using a Redshift Spectrum table (that points to data in S3) as your source.
The preferred method would be:
Use the Amazon Redshift COPY command to load the data into a Redshift table
Use a CREATE TABLE AS command to extract (ETL) the data from the new Redshift table into your desired table. If you do this on a regular basis, you can use TRUNCATE and INSERT INTO to reload the table in future.

Related

how to synchronize an external database on Spark session

I have a Delta Lake on an s3 Bucket.
Since I would like to use Spark's SQL API, I need to synchronize the Delta Lake with the local Spark session. Is there a quick way to have all the tables available, without having to create a temporary view for each one?
At the moment this is what I do (Let's suppose I have 3 tables into the s3_bucket_path "folder").
s3_bucket_path = 's3a://bucket_name/delta_lake/'
spark.read.format('delta').load(s3_bucket_path + 'table_1').createOrReplaceTempView('table_1')
spark.read.format('delta').load(s3_bucket_path + 'table_2').createOrReplaceTempView('table_2')
spark.read.format('delta').load(s3_bucket_path + 'table_3').createOrReplaceTempView('table_3')
I was wondering if there was a quicker way to have all the tables available (without having to use boto3 and iterate through the folder to get the table names), or if I wasn't following the best practices in order to work with Spark Sql Apis: should I use a different approach? I've been studying Spark for a week and I'm not 100% familiar with its architecture yet.
Thank you very much for your help.
Sounds like you'd like to use managed tables, so you have easy access to query the data with SQL, without manually registering views.
You can create a managed table as follows:
df.write.format("delta").saveAsTable("table_1")
The table path and schema information is stored in the Hive megastore (or another metastore if you've specified another metastore). Managed tables will prevent you from manually having to create the views yourself.

What Happens When a Delta Table is Created in Delta Lake?

With the Databricks Lakehouse platform, it is possible to create 'tables' or to be more specific, delta tables using a statement such as the following,
DROP TABLE IF EXISTS People10M;
CREATE TABLE People10M
USING parquet
OPTIONS (
path "/mnt/training/dataframes/people-10m.parquet",
header "true"
);
What I would like to know is, what exactly happens behind the scenes when you create one of these tables? What exactly is a table in this context? Because the data is actually contained in files in data lake (data storage location) that delta lake is running on top of.. right? Are tables some kind of abstraction that allows us to access the data stored in these files using something like SQL?
What does the USING parquet portion of this statement do? Are parquet tables different to CSV tables in some way? Or does this just depend on the format of the source data?
Any links to material that explains this idea would be appreciated? I want to understand this in depth from a technical point of view.
There are few aspects here. Your table definition is not a Delta Lake, it's Spark SQL (or Hive) syntax to define a table. It's just a metadata that allows users easily use the table without knowing where it's located, what data format, etc. You can read more about databases & tables in Databricks documentation.
The actual format for data storage is specified by the USING directive. In your case it's parquet, so when people or code will read or write data, underlying engine will first read table metadata, figure out location of the data & file format, and then will use corresponding code.
Delta is another file format (really a storage layer) that is built on the top of Parquet as data format, but adding additional capabilities such as ACID, time travel, etc. (see doc). If you want to use Delta instead of Parquet then you either need to use CONVERT TO DELTA to convert existing Parquet data into Delta, or specify USING delta when creating a completely new table.

Incremental and parallelism read from RDBMS in Spark using JDBC

I'm working on a project that involves reading data from RDBMS using JDBC and I succeeded reading the data. This is something I will be doing fairly constantly, weekly. So I've been trying to come up with a way to ensure that after the initial read, subsequent ones should only pull updated records instead of pulling the entire data from the table.
I can do this with sqoop incremental import by specifying the three parameters (--check-column, --incremental last-modified/append and --last-value). However, I dont want to use sqoop for this. Is there a way I can replicate same in Spark with Scala?
Secondly, some of the tables do not have unique column which can be used as partitionColumn, so I thought of using a row-number function to add a unique column to these table and then get the MIN and MAX of the unique column as lowerBound and upperBound respectively. My challenge now is how to dynamically parse these values into the read statement like below:
val queryNum = "select a1.*, row_number() over (order by sales) as row_nums from (select * from schema.table) a1"
val df = spark.read.format("jdbc").
option("driver", driver).
option("url",url ).
option("partitionColumn",row_nums).
option("lowerBound", min(row_nums)).
option("upperBound", max(row_nums)).
option("numPartitions", some value).
option("fetchsize",some value).
option("dbtable", queryNum).
option("user", user).
option("password",password).
load()
I know the above code is not right and might be missing a whole lot of processes but I guess it'll give a general overview of what I'm trying to achieve here.
It's surprisingly complicated to handle incremental JDBC reads in Spark. IMHO, it severely limits the ease of building many applications and may not be worth your trouble if Sqoop is doing the job.
However, it is doable. See this thread for an example using the dbtable option:
Apache Spark selects all rows
To keep this job idempotent, you'll need to read in the max row of your prior output either directly from loading all data files or via a log file that you write out each time. If your data files are massive you may need to use the log file, if smaller you could potentially load.

Write to a datepartitioned Bigquery table using the beam.io.gcp.bigquery.WriteToBigQuery module in apache beam

I'm trying to write a dataflow job that needs to process logs located on storage and write them in different BigQuery tables. Which output tables are going to be used depends on the records in the logs. So I do some processing on the logs and yield them with a key based on a value in the log. After which I group the logs on the keys. I need to write all the logs grouped on the same key to a table.
I'm trying to use the beam.io.gcp.bigquery.WriteToBigQuery module with a callable as the table argument as described in the documentation here
I would like to use a date-partitioned table as this will easily allow me to write_truncate on the different partitions.
Now I encounter 2 main problems:
The CREATE_IF_NEEDED gives an error because it has to create a partitioned table. I can circumvent this by making sure the tables exist in a previous step and if not create them.
If i load older data I get the following error:
The destination table's partition table_name_x$20190322 is outside the allowed bounds. You can only stream to partitions within 31 days in the past and 16 days in the future relative to the current date."
This seems like a limitation of streaming inserts, any way to do batch inserts ?
Maybe I'm approaching this wrong, and should use another method.
Any guidance as how to tackle these issues are appreciated.
Im using python 3.5 and apache-beam=2.13.0
That error message can be logged when one mixes the use of an ingestion-time partitioned table a column-partitioned table (see this similar issue). Summarizing from the link, it is not possible to use column-based partitioning (not ingestion-time partitioning) and write to tables with partition suffixes.
In your case, since you want to write to different tables based on a value in the log and have partitions within each table, forgo the use of the partition decorator when selecting which table (use "[prefix]_YYYYMMDD") and then have each individual table be column-based partitioned.

copy data from one spanner db to an existing spanner db

I need to find a tool or a technique that will generate insert statements from a spanner db so I can insert them into another spanner db. I need to selectively choose which insert statements or rows to migrate so the spanner export/import tool will not work. The destination db will already exist and it will have existing data in it. The amount of data is small - roughly 15 tables with 10 to 20 rows in each table. Any suggestions would be greatly appreciated.
You can use the Cloud Spanner Dataflow Connector to write your pipeline/data loader to move data in and out of Spanner. You can use a custom SQL query with the Dataflow reader to read the subset of data that you want to export.
Depending on how wide your tables are, if you are dealing with a relatively small amount of data, a simpler way to this could be using the gcloud spanner databases execute-sql command-line utility. For each of your tables, you could use the utility to run a SQL query to get the rows you want to export from the table and write the result to a file in the csv format using the --format=csv argument. Then you could write a small wrapper around Cloud Spanner Insert APIs to read the data from the CSV files and send insert mutations to the target database.

Resources