I'm connecting to a postgresql database in AWS Redshift using SQLAlchemy to do some data processing, I need to extract the DDL information of each table in a particular Schema. I cant run any commands like pg_dump --schema-only. What will be the simplest way of extracting the DDL?
you can get all tables with reflection system and print CreateTable construct of each table:
from sqlalchemy.schema import MetaData
from sqlalchemy.schema import CreateTable
meta = MetaData()
meta.reflect(bind=engine)
for table in meta.sorted_tables:
print(CreateTable(table).compile(engine))
Related
I want to connect Apache Superset with Apache Spark (I have Spark 3.1.2) and Query the data on Superset's SQL Lab using Apache Spark SQL.
On spark's master, I started thrift server using this command spark-submit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.
Then I added Spark cluster as a database in Superset using SQLAlchemy URI hive://hive#spark:10000/. I am able to access Spark cluster on Superset.
I can load JSON data as table using this SQL
CREATE table IF NOT EXISTS test_table
USING JSON
LOCATION "/path/to/data.json"
and I am able to Query data using simple SQL statements like SELECT * FROM test_table LIMIT 10
BUT the problem is that json data is compressed as gzipped files.
So I tried
CREATE table IF NOT EXISTS test_table
USING JSON
LOCATION "/path/to/data.json.gz"
but it did not work. I want to know how do load gzipped json data into a table
Compressed JSON storage
If you have large JSON text you can explicitly compress JSON text using built-in COMPRESS function. In the following example compressed JSON content is stored as binary data, and we have computed column that decompress JSON as original text using DECOMPRESS function:
CREATE TABLE Person
( _id int identity constraint PK_JSON_ID primary key,
data varbinary(max),
value AS CAST(DECOMPRESS(data) AS nvarchar(max))
)
INSERT INTO Person(data)
VALUES (COMPRESS(#json))
COMPRESS and DECOMPRESS functions use standard GZip compression.
Another example:
CREATE EXTENSION json_fdw;
postgres=# CREATE SERVER json_server FOREIGN DATA WRAPPER json_fdw;
postgres=# CREATE FOREIGN TABLE customer_reviews
(
customer_id TEXT,
"review.date" DATE,
"review.rating" INTEGER,
"product.id" CHAR(10),
"product.group" TEXT,
"product.title" TEXT,
"product.similar_ids" CHAR(10)[]
)
SERVER json_server
OPTIONS (filename '/home/citusdata/customer_reviews_nested_1998.json.gz');
Note: This example was taken from https://www.citusdata.com/blog/2013/05/30/run-sql-on-json-files-without-any-data-loads
I connect to HANA database from Python and read any given table from a schema into a Pandas dataframe using the following code:
from hdbcli import dbapi
conn = dbapi.connect(
address=XXXX,
port=32015,
user="username",
password="password",
)
schema = <schema_name>
tablename = <table name>
pd.read_sql(f'select * from {schema}.{tablename}',conn)
This code works without any issue - I am able to download the table into a Pandas Data Frame.
However, I am unable to upload any Pandas Data Frame back to HANA db, even if it is the same schema.
xy.to_sql('new_table',conn)
I tried to even pre-define the table to which to upload in HANA Studio, and define its columns and data types. Nonetheless, I get the following error:
DatabaseError: Execution failed on sql 'SELECT name FROM sqlite_master WHERE type='table' AND name=?;': (259, 'invalid table name: Could not find table/view SQLITE_MASTER in schema <RANDOM_SCHEMA>: line 1 col 18 (at pos 17)')
It is important to note that the <RANDOM_SCHEMA> in the above error is not the same schema that was defined above, but it is the my username for HANA Studio.
I thought that since I can read the table into Data Frame, I should be able to write the data frame into a HANA DB table. Am I wrong? What am I missing?
For some reason the code tries to read from a SQLlite catalog table sqlite_master and that table doesn’t exist on HANA (or any other DBMS that is not SQLlite).
Not sure if PANDAS can be configured to use different DBMS differently.
However, for HANA there is a “machine learning” python library available that provides easy integration of dataframes with the HANA database.
I am trying to use Azure Databricks in order to :
1- insert rows into table of Azure SQL Databse with python 3. I cannot see a documentation about insert rows. (I have use this link to connect to the database Doc and it is working).
2- Save Csv file in my datalake
3- Create Table from Dataframe if possible
Thanks for your help and sorry for my novice questions
**1- insert rows into table of Azure SQL Databse with python 3. **
Azure Databricks has installed the JDBC driver. We can use JDBC driver to write data to SQL Server with a Dataframe. For more details, please refer to here.
For example
jdbcHostname = "<hostname>"
jdbcDatabase = ""
jdbcPort = 1433
jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
connectionProperties = {
"user" : jdbcUsername,
"password" : jdbcPassword,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
#write
df=spark.createDataFrame([(1, "test1"),(2,"test2")],["id", "name"])
df.write.jdbc(url=jdbcUrl,table="users",mode="overwrite",properties=connectionProperties)
#check
df1 = spark.read.jdbc(url=jdbcUrl, table='users', properties=connectionProperties)
display(df1)
2- Create Table from Dataframe
If you want to create a DataBricks table from datafarme, you can use the method registerTempTable or saveAsTable.
registerTempTable creates an in-memory table that is scoped to the cluster in which it was created. The data is stored using Hive's highly-optimized, in-memory columnar format.
saveAsTable creates a permanent, physical table stored in S3 using the Parquet format. This table is accessible to all clusters including the dashboard cluster. The table metadata including the location of the file(s) is stored within the Hive metastore.
For more details, please refer to here and here.
Can we write data directly into snowflake table without using Snowflake internal stage using Python????
It seems auxiliary task to write in stage first and then transform it and then load it into table. can it done in one step only just like JDBC connection in RDBMS.
The absolute fastest way to load data into Snowflake is from a file on either internal or external stage. Period. All connectors have the ability to insert the data with standard insert commands, but this will not perform as well. That said, many of the Snowflake drivers are now transparently using PUT/COPY commands to load large data to Snowflake via internal stage. If this is what you are after, then you can leverage the pandas write_pandas command to load data from a pandas dataframe to Snowflake in a single command. Behind the scenes, it will execute the PUT and COPY INTO for you.
https://docs.snowflake.com/en/user-guide/python-connector-api.html#label-python-connector-api-write-pandas
I highly recommend this pattern over INSERT commands in any driver. And I would also recommend transforms be done AFTER loading to Snowflake, not before.
If someone is having issues with large datasets. Try Using dask instead and generate your dataframe partitioned into chunks. Then you can use dask.delayed with sqlalchemy. Here we are using snowflake's native connector method i.e. pd_writer, which under the hood uses write_pandas and eventually using PUT COPY with compressed parquet file. At the end it comes down to your I/O bandwidth to be honest. The more throughput you have, the faster it gets loaded in Snowflake Table. But this snippet provides decent amount of parallelism overall.
import functools
from dask.diagnostics import ProgressBar
from snowflake.connector.pandas_tools import pd_writer
import dask.dataframe as dd
df = dd.read_csv(csv_file_path, blocksize='64MB')
ddf_delayed = df.to_sql(
table_name.lower(),
uri=str(engine.url),
schema=schema_name,
if_exists=if_exists,
index=False,
method=functools.partial(
pd_writer,quote_identifiers=False),
compute=False,
parallel=True
)
with ProgressBar():
dask.compute(ddf_delayed, scheduler='threads', retries=3)
Java:
Load Driver Class:
Class.forName("net.snowflake.client.jdbc.SnowflakeDriver")
Maven:
Add following code block as a dependency
<dependency>
<groupId>net.snowflake</groupId>
<artifactId>snowflake-jdbc</artifactId>
<version>{version}</version>
Spring :
application.yml:
spring:
datasource
hikari:
maximumPoolSize: 4 # Specify maximum pool size
minimumIdle: 1 # Specify minimum pool size
driver-class-name: com.snowflake.client.jdbc.SnowflakeDriver
Python :
import pyodbc
# pyodbc connection string
conn = pyodbc.connect("Driver={SnowflakeDSIIDriver}; Server=XXX.us-east-2.snowflakecomputing.com; Database=VAQUARKHAN_DB; schema=public; UID=username; PWD=password")
# Cursor
cus=conn.cursor()
# Execute SQL statement to get current datetime and store result in cursor
cus.execute("select current_date;")
# Display the content of cursor
row = cus.fetchone()
print(row)
How to insert json response data in snowflake database more efficiently?
Apache Spark:
<dependency>
<groupId>net.snowflake</groupId>
<artifactId>spark-snowflake_2.11</artifactId>
<version>2.5.9-spark_2.4</version>
</dependency>
Code
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.DataFrame
/ Use secrets DBUtil to get Snowflake credentials.
val user = dbutils.secrets.get("data-warehouse", "<snowflake-user>")
val password = dbutils.secrets.get("data-warehouse", "<snowflake-password>")
val options = Map(
"sfUrl" -> "<snowflake-url>",
"sfUser" -> user,
"sfPassword" -> password,
"sfDatabase" -> "<snowflake-database>",
"sfSchema" -> "<snowflake-schema>",
"sfWarehouse" -> "<snowflake-cluster>"
)
// Generate a simple dataset containing five values and write the dataset to Snowflake.
spark.range(5).write
.format("snowflake")
.options(options)
.option("dbtable", "<snowflake-database>")
.save()
// Read the data written by the previous cell back.
val df: DataFrame = spark.read
.format("snowflake")
.options(options)
.option("dbtable", "<snowflake-database>")
.load()
display(df)
Fastest way to load data into Snowflake is from a file
https://community.snowflake.com/s/article/How-to-Load-Terabytes-Into-Snowflake-Speeds-Feeds-and-Techniques
https://bryteflow.com/how-to-load-terabytes-of-data-to-snowflake-fast/
https://www.snowflake.com/blog/ability-to-connect-to-snowflake-with-jdbc/
https://docs.snowflake.com/en/user-guide/jdbc-using.html
https://www.persistent.com/blogs/json-processing-in-spark-snowflake-a-comparison/
I have a small log dataframe which has metadata regarding the ETL performed within a given notebook, the notebook is part of a bigger ETL pipeline managed in Azure DataFactory.
Unfortunately, it seems that Databricks cannot invoke stored procedures so I'm manually appending a row with the correct data to my log table.
however, I cannot figure out the correct sytnax to update a table given a set of conditions :
the statement I use to append a single row is as follows :
spark_log.write.jdbc(sql_url, 'internal.Job',mode='append')
this works swimmingly however, as my Data Factory is invoking a stored procedure,
I need to work in a query like
query = f"""
UPDATE [internal].[Job] SET
[MaxIngestionDate] date {date}
, [DataLakeMetadataRaw] varchar(MAX) NULL
, [DataLakeMetadataCurated] varchar(MAX) NULL
WHERE [IsRunning] = 1
AND [FinishDateTime] IS NULL"""
Is this possible ? if so can someone show me how?
Looking at the documentation this only seems to mention using select statements with the query parameter :
Target Database is an Azure SQL Database.
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
just to add this is a tiny operation, so performance is a non-issue.
You can't do single record updates using jdbc in Spark with dataframes. You can only append or replace the entire table.
You can do updates using pyodbc- requires installing the MSSQL ODBC driver (How to install PYODBC in Databricks) or you can use jdbc via JayDeBeApi (https://pypi.org/project/JayDeBeApi/)