Upload Pandas dataframe to HANA database using HDBCLI / DBAPI - python-3.x

I connect to HANA database from Python and read any given table from a schema into a Pandas dataframe using the following code:
from hdbcli import dbapi
conn = dbapi.connect(
address=XXXX,
port=32015,
user="username",
password="password",
)
schema = <schema_name>
tablename = <table name>
pd.read_sql(f'select * from {schema}.{tablename}',conn)
This code works without any issue - I am able to download the table into a Pandas Data Frame.
However, I am unable to upload any Pandas Data Frame back to HANA db, even if it is the same schema.
xy.to_sql('new_table',conn)
I tried to even pre-define the table to which to upload in HANA Studio, and define its columns and data types. Nonetheless, I get the following error:
DatabaseError: Execution failed on sql 'SELECT name FROM sqlite_master WHERE type='table' AND name=?;': (259, 'invalid table name: Could not find table/view SQLITE_MASTER in schema <RANDOM_SCHEMA>: line 1 col 18 (at pos 17)')
It is important to note that the <RANDOM_SCHEMA> in the above error is not the same schema that was defined above, but it is the my username for HANA Studio.
I thought that since I can read the table into Data Frame, I should be able to write the data frame into a HANA DB table. Am I wrong? What am I missing?

For some reason the code tries to read from a SQLlite catalog table sqlite_master and that table doesn’t exist on HANA (or any other DBMS that is not SQLlite).
Not sure if PANDAS can be configured to use different DBMS differently.
However, for HANA there is a “machine learning” python library available that provides easy integration of dataframes with the HANA database.

Related

Databricks accessing DataFrame in SQL

I'm learning Databricks and got stuck on the simplest step.
I'd like to utilize my DataFrame from DB's SQL ecosystem
Here are my steps:
df = spark.read.csv('dbfs:/databricks-datasets/COVID/covid-19-data/us.csv', header=True, inferSchema=True)
display(df)
Everything is fine, df is displayed. Then submitting:
df.createOrReplaceGlobalTempView("covid")
Finally:
%sql
show tables
No results are displayed. When trying:
display(spark.sql('SELECT * FROM covid LIMIT 10'))
Getting the error:
[TABLE_OR_VIEW_NOT_FOUND] The table or view `covid` cannot be found
When executing:
df.createGlobalTempView("covid")
Again, I'm getting a message covid already exists.
How to access my df from sql ecosystem, please?
In a Databricks notebook, if you're looking to utilize SQL to query your dataframe loaded in python,
you can do so in the following way (using your example data):
Setup df in python
df = spark.read.csv('dbfs:/databricks-datasets/COVID/covid-19-data/us.csv', header=True, inferSchema=True)
setup your global view
df.createGlobalTempView("covid")
Then a simply query in SQL will be equivalent to display() function
%sql
SELECT * FROM global_temp.covid
If you want to avoid using global_temp prefix, use df.createTempView

Partitioned table on synapse

I'm trying to create a new partitioned table on my SqlDW (synapse) based on a partitioned table on Spark (synapse) with
%%spark
val df1 = spark.sql("SELECT * FROM sparkTable")
df1.write.partitionBy("year").sqlanalytics("My_SQL_Pool.dbo.StudentFromSpak", Constants.INTERNAL )
Error : StructuredStream-spark package version: 2.4.5-1.3.1
StructuredStream-spark package version: 2.4.5-1.3.1
StructuredStream-spark package version: 2.4.5-1.3.1
java.sql.SQLException:
com.microsoft.sqlserver.jdbc.SQLServerException: External file access
failed due to internal error: 'File
/synapse/workspaces/test-partition-workspace/sparkpools/myspark/sparkpoolinstances/c5e00068-022d-478f-b4b8-843900bd656b/livysessions/2021/03/09/1/tempdata/SQLAnalyticsConnectorStaging/application_1615298536360_0001/aDtD9ywSeuk_shiw47zntKz.tbl/year=2000/part-00004-5c3e4b1a-a580-4c7e-8381-00d92b0d32ea.c000.snappy.parquet:
HdfsBridge::CreateRecordReader - Unexpected error encountered
creating the record reader: HadoopExecutionException: Column count
mismatch. Source file has 5 columns, external table definition has 6
columns.' at
com.microsoft.spark.sqlanalytics.utils.SQLAnalyticsJDBCWrapper.executeUpdateStatement(SQLAnalyticsJDBCWrapper.scala:89)
at
thanks
The sqlanalytics() function name has been changed to synapsesql(). It does not currently support writing partitioned tables but you could implement this yourself, eg by writing multiple tables back to the dedicated SQL pool and the using partition switching back there.
The syntax is simply (as per the documentation):
df.write.synapsesql("<DBName>.<Schema>.<TableName>", <TableType>)
An example would be:
df.write.synapsesql("yourDb.dbo.yourTablePartition1", Constants.INTERNAL)
df.write.synapsesql("yourDb.dbo.yourTablePartition2", Constants.INTERNAL)
Now do the partition switching in the database using the ALTER TABLE ... SWITCH PARTITION syntax.

Databricks with python 3 for Azure SQl Databas and python

I am trying to use Azure Databricks in order to :
1- insert rows into table of Azure SQL Databse with python 3. I cannot see a documentation about insert rows. (I have use this link to connect to the database Doc and it is working).
2- Save Csv file in my datalake
3- Create Table from Dataframe if possible
Thanks for your help and sorry for my novice questions
**1- insert rows into table of Azure SQL Databse with python 3. **
Azure Databricks has installed the JDBC driver. We can use JDBC driver to write data to SQL Server with a Dataframe. For more details, please refer to here.
For example
jdbcHostname = "<hostname>"
jdbcDatabase = ""
jdbcPort = 1433
jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
connectionProperties = {
"user" : jdbcUsername,
"password" : jdbcPassword,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
#write
df=spark.createDataFrame([(1, "test1"),(2,"test2")],["id", "name"])
df.write.jdbc(url=jdbcUrl,table="users",mode="overwrite",properties=connectionProperties)
#check
df1 = spark.read.jdbc(url=jdbcUrl, table='users', properties=connectionProperties)
display(df1)
2- Create Table from Dataframe
If you want to create a DataBricks table from datafarme, you can use the method registerTempTable or saveAsTable.
registerTempTable creates an in-memory table that is scoped to the cluster in which it was created. The data is stored using Hive's highly-optimized, in-memory columnar format.
saveAsTable creates a permanent, physical table stored in S3 using the Parquet format. This table is accessible to all clusters including the dashboard cluster. The table metadata including the location of the file(s) is stored within the Hive metastore.
For more details, please refer to here and here.

Get DDL from existing databases SQLAlchemy

I'm connecting to a postgresql database in AWS Redshift using SQLAlchemy to do some data processing, I need to extract the DDL information of each table in a particular Schema. I cant run any commands like pg_dump --schema-only. What will be the simplest way of extracting the DDL?
you can get all tables with reflection system and print CreateTable construct of each table:
from sqlalchemy.schema import MetaData
from sqlalchemy.schema import CreateTable
meta = MetaData()
meta.reflect(bind=engine)
for table in meta.sorted_tables:
print(CreateTable(table).compile(engine))

update table from Pyspark using JDBC

I have a small log dataframe which has metadata regarding the ETL performed within a given notebook, the notebook is part of a bigger ETL pipeline managed in Azure DataFactory.
Unfortunately, it seems that Databricks cannot invoke stored procedures so I'm manually appending a row with the correct data to my log table.
however, I cannot figure out the correct sytnax to update a table given a set of conditions :
the statement I use to append a single row is as follows :
spark_log.write.jdbc(sql_url, 'internal.Job',mode='append')
this works swimmingly however, as my Data Factory is invoking a stored procedure,
I need to work in a query like
query = f"""
UPDATE [internal].[Job] SET
[MaxIngestionDate] date {date}
, [DataLakeMetadataRaw] varchar(MAX) NULL
, [DataLakeMetadataCurated] varchar(MAX) NULL
WHERE [IsRunning] = 1
AND [FinishDateTime] IS NULL"""
Is this possible ? if so can someone show me how?
Looking at the documentation this only seems to mention using select statements with the query parameter :
Target Database is an Azure SQL Database.
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
just to add this is a tiny operation, so performance is a non-issue.
You can't do single record updates using jdbc in Spark with dataframes. You can only append or replace the entire table.
You can do updates using pyodbc- requires installing the MSSQL ODBC driver (How to install PYODBC in Databricks) or you can use jdbc via JayDeBeApi (https://pypi.org/project/JayDeBeApi/)

Resources