copy data from one spanner db to an existing spanner db - google-cloud-spanner

I need to find a tool or a technique that will generate insert statements from a spanner db so I can insert them into another spanner db. I need to selectively choose which insert statements or rows to migrate so the spanner export/import tool will not work. The destination db will already exist and it will have existing data in it. The amount of data is small - roughly 15 tables with 10 to 20 rows in each table. Any suggestions would be greatly appreciated.

You can use the Cloud Spanner Dataflow Connector to write your pipeline/data loader to move data in and out of Spanner. You can use a custom SQL query with the Dataflow reader to read the subset of data that you want to export.
Depending on how wide your tables are, if you are dealing with a relatively small amount of data, a simpler way to this could be using the gcloud spanner databases execute-sql command-line utility. For each of your tables, you could use the utility to run a SQL query to get the rows you want to export from the table and write the result to a file in the csv format using the --format=csv argument. Then you could write a small wrapper around Cloud Spanner Insert APIs to read the data from the CSV files and send insert mutations to the target database.

Related

Writing Data to External Databases Through PySpark

I want to write the data from a PySpark DataFrame to external databases, say an Azure MySQL database. So far, I have managed to do this using .write.jdbc(),
spark_df.write.jdbc(url=mysql_url, table=mysql_table, mode="append", properties={"user":mysql_user, "password": mysql_password, "driver": "com.mysql.cj.jdbc.Driver" })
Here, if I am not mistaken, the only options available for mode are append and overwrite, however, I want to have more control over how the data is written. For example, I want to be able to perform update and delete operations.
How can I do this? Is it possible to say, write SQL queries to write data to the external databases? If so, please give me an example.
First I suggest you use the specific Azure SQL connector. https://learn.microsoft.com/en-us/azure/azure-sql/database/spark-connector.
Then I recommend you use bulk mode as row by row mode is slow, and can incur unexpected charges if you have log analytics turned on.
Lastly, for any kind of data transformation, you should use an ELT pattern:
Load raw data into an empty staging table
Run SQL code, or even better, a stored procedure which performs required logic (for example merging into a final table) run DML such as a stored proc

Fastest way to write data into the database

My usecase is such..
I have a redshift cluster where i upload data into (Basically, i use pandas to just replace the data everyday).The frequency of the upload is every hour and the number of records are close to 35k.(They keep increasing everyday)
Now, i wanted to know the quickest way to write the data into the cluster.
Do i manually delete the existing data by using a delete query and then write data to redshift by using "dataframe.to_sql" ?
Do i just let the "dataframe.to_sql" function do the job automatically by adding "if_exists = replace" option?
Which is the quickest way to deal with data with huge number of records?
Apparently sqlalchemy-redshift uses psycopg2 so if you search for similar questions regarding PostgreSQL you should find some examples that might be helpful. For example, at the very least the method="multi" option of pandas' to_sql method might help speed up the upload.
As for deleting the data vs. dropping and re-creating the table via if_exists="replace", the former will likely be faster, especially if you can TRUNCATE the table instead of just deleting all the rows.

Loading data from AWS redshift using python

I'm facing a mission impossible to extract a huge amount of data from Amazone Redshift to another table. It definitely requires a more efficient approach but I'm new to SQL and AWS so decided to ask this smart community for advice.
This is my initial SQL query which takes forever:
-- STEP 1: CREATE A SAMPLE FOR ONE MONTH
SELECT DISTINCT at_id, utc_time, name
INTO my_new_table
FROM s3_db.table_x
WHERE type = 'create'
AND (dt BETWEEN '20181001' AND '20181031');
What would be the best approach? I was thinking of using python and sqlalchemy to create dataframes with chunks of 1m rows and inserting it back into the new table (which I need to create beforehand). Would this work?:
from sqlalchemy import create_engine
import os
import pandas as pd
redshift_user = os.environ['REDSHIFT_USER']
redshift_password = os.environ['REDSHIFT_PASSWORD']
engine_string = "postgresql+psycopg2://%s:%s#%s:%d/%s" \
% (redshift_user, redshift_password, 'localhost', XXXX, 'redshiftdb')
engine = create_engine(engine_string)
for df in pd.read_sql_query("""
SELECT DISTINCT at_id, utc_time, name
INSERT INTO my_new_table
FROM s3_db.table_x
WHERE type = 'create'
AND (dt BETWEEN '20181001' AND '20181031');
""", engine, chunksize=1000000):
You should use CREATE TABLE AS.
This allows you to specify a SELECT statement and have the results directly stored into a new table.
This is hugely more efficient than downloading data and re-uploading.
You can also CREATE TABLE LIKE and then load it with data. See: Performing a Deep Copy
You could also UNLOAD data to Amazon S3, then load it again via COPY, but using CREATE TABLE AS is definitely the best option.
Please refer AWS guidelines for RedShift and Spectrum best practices; I've put the links at the end of this post. Based on your question, I am assuming you want to extract, transform and load huge amount of data from RedShift Spectrum based table "s3_db.table_x" to new RedShift table "my_new_table"
Here are some suggestions based on AWS recommendations:
Create your RedShift table with appropriate distribution key, sort key and compression encoding. At high level, "at_id" seems best suited as partition key and "utc_time" as sortkey for your requirement, but make sure to refer AWS guidelines for RedShift table design 3.
As you mentioned, your data volume is huge, you may like to have your S3 source table "s3_db.table_x" partitioned based on "type" and "dt" columns (as suggested at point number 4 in spectrum best practices 1).
Replace DISTINCTwith GROUP BY in the select query from Spectrum (point number 9 in Spectrum Best Practices 1).
AWS recommends (point number 7 in Spectrum best practices 1) to simplify your ETL process using CREATE TABLE AS SELECT or SELECT INTO statements, wherein you may put your transformation logic in the select component to load data directly form S3 to RedShift.
redshift spectrum best practices
redshift best practices
redshift table design playbook
It now appears that your source data is stored in Amazon S3 and you have been using a Redshift Spectrum table (that points to data in S3) as your source.
The preferred method would be:
Use the Amazon Redshift COPY command to load the data into a Redshift table
Use a CREATE TABLE AS command to extract (ETL) the data from the new Redshift table into your desired table. If you do this on a regular basis, you can use TRUNCATE and INSERT INTO to reload the table in future.

What are smart data sources in spark?

I wanted to know what data sources can be called 'smart' in spark. As per book "Mastering Apache Spark 2.x", any data source can be called smart if spark can process data at data source side. Example JDBC sources.
I want to know if MongoDB, Cassandra and parquet could be considered as smart data sources as well?
I believe smart data sources can be those as well. At least according to slides 41 to 42 you can see mention of smart data sources and logos including those sources (note that mongodb logo isn't there but I believe it supports the same thing https://www.mongodb.com/products/spark-connector, see section "Leverage the Power of MongoDB") from the Databricks presentation here: https://www.slideshare.net/databricks/bdtc2
I was also able to find some information supporting that MongoDB is a smart data source, since it's used as an example in the "Mastering Apache Spark 2.x" book:
"Predicate push-down on smart data sources Smart data sources are those that support data processing directly in their own engine-where the data resides--by preventing unnecessary data to be sent to Apache Spark.
On example is a relational SQL database with a smart data source. Consider a table with three columns: column1, column2, and column3, where the third column contains a timestamp. In addition, consider an ApacheSparkSQL query using this JDBC data source but only accessing a subset of columns and rows based using projection and selection. The following SQL query is an example of such a task:
select column2,column3 from tab where column3>1418812500
Running on a smart data source, data locality is made use of by letting the SQL database do the filtering of rows based on timestamp and removal of column1. Let's have a look at a practical example on how this is implemented in the Apache Spark MongoDB connector"

Limited results when using Excel BigQuery connector

I've pulled a data extract from BigQuery using the Excel connector but my results have been limited to 230,000 records.
Is this a limitation of the connector or something I have not done properly?
BigQuery does have a maximum response size of 64MB (compressed). So, depending on the size of your rows, it's quite possible that 230,000 is the maximum size response BigQuery can return.
See more info on quotas here:
https://developers.google.com/bigquery/docs/quota-policy
What's the use case -- and how many rows are you expecting to be returned? Generally BigQuery is used for large aggregate analysis, rather than results which return tons of unaggregated results. You can dump the entire table as a CSV into Google Cloud Storage if you're looking for your raw dataset too.
Also, you may want to try running the query in the UI at:
https://bigquery.cloud.google.com/

Resources