How to insert multiple rows of a pandas dataframe into Azure Synapse SQL DW using pyodbc? - azure

I am using pyodbc to establish connection with Azure Synapse SQL DW. The connection is successfully established. However when it comes to inserting a pandas dataframe into the database, I am getting an error when I try inserting multiple rows as values. However, it works if I insert rows one by one. Inserting multiple rows together as values used to work fine with AWS Redshift and MS SQL, but fails with Azure Synapse SQL DW. I think the Azure Synapse SQL is T-SQL and not MS-SQL. Nonetheless, I am unable to find any relevant documentation as well.
I have a pandas df named 'df' that looks like this:
student_id admission_date
1 2019-12-12
2 2018-12-08
3 2018-06-30
4 2017-05-30
5 2020-03-11
This code below works fine
import pandas as pd
import pyodbc
#conn object below is the pyodbc 'connect' object
batch_size = 1
i = 0
chunk = df[i:i+batch_size]
conn.autocommit = True
sql = 'insert INTO {} values {}'.format('myTable', ','.join(
str(e) for e in zip(chunk.student_id.values, chunk.admission_date.values.astype(str))))
print(sql)
cursor = conn.cursor()
cursor.execute(sql)
As you can see, it's inserting just 1 row of the 'df'. So, yes, I can loop through and insert one by one but it takes hell lot of time when it comes dataframes of larger sizes
This code below doesn't work when I try to insert all rows together
import pandas as pd
import pyodbc
batch_size = 5
i = 0
chunk = df[i:i+batch_size]
conn.autocommit = True
sql = 'insert INTO {} values {}'.format('myTable', ','.join(
str(e) for e in zip(chunk.student_id.values, chunk.admission_date.values.astype(str))))
print(sql)
cursor = conn.cursor()
cursor.execute(sql)
The error I get this one below:
ProgrammingError: ('42000', "[42000]
[Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Parse error at
line: 1, column: 74: Incorrect syntax near ','. (103010)
(SQLExecDirectW)")
This is the sample SQL query for 2 rows which fails:
insert INTO myTable values (1, '2009-12-12'),(2, '2018-12-12')

That's because Azure Synapse SQL does not support multi-row insert via the values constructor.
One work around is to chain "select (value list) union all". Your pseudo SQL should look like so:
insert INTO {table}
select {chunk.student_id.values}, {chunk.admission_date.values.astype(str)} union all
...
select {chunk.student_id.values}, {chunk.admission_date.values.astype(str)}

COPY statement in Azure Synapse Analytics is a better way for loading your data in Synapse SQL Pool.
COPY INTO test_parquet
FROM 'https://myaccount.blob.core.windows.net/myblobcontainer/folder1/*.parquet'
WITH (
FILE_FORMAT = myFileFormat,
CREDENTIAL=(IDENTITY= 'Shared Access Signature', SECRET='<Your_SAS_Token>')
)
You can save your pandas dataframe into blob storage, and then trigger the copy command using execute method.

Related

Dask read_sql_query did not execute sql that I put in

Hi all I'm new to Dask.
I faced an error when I tried using read_sql_query to get data from Oracle database.
Here is my python script:
con_str = "oracle+cx_oracle://{UserID}:{Password}#{Domain}/?service_name={Servicename}"
sql= "
column_a, column_b
from
database.tablename
where
mydatetime >= to_date('1997-01-01 00:00:00','YYYY-MM-DD HH24:MI:SS')
"
from sqlalchemy.sql import select, text
from dask.dataframe import read_sql_query
sa_query= select(text(sql))
ddf = read_sql_query(sql=sa_query, con=con, index_col="index", head_rows=5)
I refered this post: Reading an SQL query into a Dask DataFrame
Remove "select" string from my query.
And I got an cx_Oracle.DatabaseError with missing expression [SQL: SELECT FROM DUAL WHERE ROWNUM <= 5]
But I don't get it where the query came from.
Seem like it didn't execute the sql code I provided.
I'm not sure which part I did not config right.
*Note: using pandas.read_sql is ok , only fail when using dask.dataframe.read_sql_query

Synapse Dedicated SQL Pool - Copy Into Failing With Odd error - Python

I'm getting an error when attempting to insert from a temp table into a table that exists in Synapse, here is the relevant code:
def load_adls_data(self, schema: str, table: str, environment: str, filepath: str, columns: list) -> str:
if self.exists_schema(schema):
if self.exists_table(schema, table):
if environment.lower() == 'prod':
schema = "lvl0"
else:
schema = f"{environment.lower()}_lvl0"
temp_table = self.generate_temp_create_table(schema, table, columns)
sql0 = """
IF OBJECT_ID('tempdb..#CopyDataFromADLS') IS NOT NULL
BEGIN
DROP TABLE #CopyDataFromADLS;
END
"""
sql1 = """
{}
COPY INTO #CopyDataFromADLS FROM
'{}'
WITH
(
FILE_TYPE = 'CSV',
FIRSTROW = 1
)
INSERT INTO {}.{}
SELECT *, GETDATE(), '{}' from #CopyDataFromADLS
""".format(temp_table, filepath, schema, table, Path(filepath).name)
print(sql1)
conn = pyodbc.connect(self._synapse_cnx_str)
conn.autocommit = True
with conn.cursor() as db:
db.execute(sql0)
db.execute(sql1)
If I get rid of the insert statement and just do a select from the temp table in the script:
SELECT * FROM #CopyDataFromADLS
I get the same error in either case:
pyodbc.ProgrammingError: ('42000', '[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Not able to validate external location because The remote server returned an error: (409) Conflict. (105215) (SQLExecDirectW)')
I've run the generated code for both the insert and the select in Synapse and they ran perfectly. Google has no real info on this so could someone assist with this? Thanks
pyodbc.ProgrammingError: ('42000', '[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Not able to validate external location because The remote server returned an error: (409) Conflict. (105215) (SQLExecDirectW)')
This error occurs mostly because of authentication or access.
Make sure you have blob storage contributor access.
In the copy into script, add the authentication key for blob storage, unless it is a public blob storage.
I tried to repro this using copy into statement without authentication and got the same error.
After adding authentication using SAS key data is copied successfully.
Refer the Microsoft document for permissions required for bulk load using copy into statements.

How do you setup a Synapse Serverless SQL External Table over partitioned data?

I have setup a Synapse workspace and imported the Covid19 sample data into a PySpark notebook.
blob_account_name = "pandemicdatalake"
blob_container_name = "public"
blob_relative_path = "curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet"
blob_sas_token = r""
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s#%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
blob_sas_token)
df = spark.read.parquet(wasbs_path)
I have then partitioned the data by country_region, and written it back down into my storage account.
df.write.partitionBy("country_region") /
.mode("overwrite") /
.parquet("abfss://rawdata#synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/")
All that works fine as you can see. So far I have only found a way to query data from the exact partition using OPENROWSET, like this...
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/country_region=Afghanistan/**',
FORMAT = 'PARQUET'
) AS [result]
I want to setup an Serverless SQL External table over the partition data, so that when people run a query and use "WHERE country_region = x" it will only read the appropriate partition. Is this possible, and if so how?
You need to get the partition value using the filepath function like this. Then filter on it. That achieves partition elimination. You can confirm by the bytes read compared to when you don’t filter on that column.
CREATE VIEW MyView
As
SELECT
*, filepath(1) as country_region
FROM
OPENROWSET(
BULK 'https://synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/country_region=*/*',
FORMAT = 'PARQUET'
) AS [result]
GO
Select * from MyView where country_region='Afghanistan'

Pandas .to_sql fails silently randomly

I have several large pandas dataframes (about 30k+ rows) and need to upload a different version of them daily to a MS SQL Server db. I am trying to do so with the to_sql pandas function. On occasion, it will work. Other times, it will fail - silently - as if the code uploaded all of the data despite not having uploaded a single row.
Here is my code:
class SQLServerHandler(DataBaseHandler):
...
def _getSQLAlchemyEngine(self):
'''
Get an sqlalchemy engine
from the connection string
The fast_executemany fails silently:
https://stackoverflow.com/questions/48307008/pandas-to-sql-doesnt-insert-any-data-in-my-table/55406717
'''
# escape special characters as required by sqlalchemy
dbParams = urllib.parse.quote_plus(self.connectionString)
# create engine
engine = sqlalchemy.create_engine(
'mssql+pyodbc:///?odbc_connect={}'.format(dbParams))
return engine
#logExecutionTime('Time taken to upload dataframe:')
def uploadData(self, tableName, dataBaseSchema, dataFrame):
'''
Upload a pandas dataFrame
to a database table <tableName>
'''
engine = self._getSQLAlchemyEngine()
dataFrame.to_sql(
tableName,
con=engine,
index=False,
if_exists='append',
method='multi',
chunksize=50,
schema=dataBaseSchema)
Switching the method to None seems to work properly but the data takes an insane amount of time to upload (30+ mins). Having multiple tables (20 or so) a day of this size discards this solution.
The proposed solution here to add the schema as a parameter doesn't work. Neither does creating a sqlalchemy session and passsing it to the con parameter with session.get_bind().
I am using:
ODBC Driver 17 for SQL Server
pandas 1.2.1
sqlalchemy 1.3.22
pyodbc 4.0.30
Does anyone know how to make it raise an exception if it fails?
Or why it is not uploading any data?
In rebuttal to this answer, if to_sql() was to fall victim to the issue described in
SQL Server does not finish execution of a large batch of SQL statements
then it would have to be constructing large anonymous code blocks of the form
-- Note no SET NOCOUNT ON;
INSERT INTO gh_pyodbc_262 (id, txt) VALUES (0, 'row0');
INSERT INTO gh_pyodbc_262 (id, txt) VALUES (1, 'row1');
INSERT INTO gh_pyodbc_262 (id, txt) VALUES (2, 'row2');
…
and that is not what to_sql() is doing. If it were, then it would start to fail well below 1_000 rows, at least on SQL Server 2017 Express Edition:
import pandas as pd
import pyodbc
import sqlalchemy as sa
print(pyodbc.version) # 4.0.30
table_name = "gh_pyodbc_262"
num_rows = 400
print(f" num_rows: {num_rows}") # 400
cnxn = pyodbc.connect("DSN=mssqlLocal64", autocommit=True)
crsr = cnxn.cursor()
crsr.execute(f"TRUNCATE TABLE {table_name}")
sql = "".join(
[
f"INSERT INTO {table_name} ([id], [txt]) VALUES ({i}, 'row{i}');"
for i in range(num_rows)
]
)
crsr.execute(sql)
row_count = crsr.execute(f"SELECT COUNT(*) FROM {table_name}").fetchval()
print(f"row_count: {row_count}") # 316
Using to_sql() for that same operation works
import pandas as pd
import pyodbc
import sqlalchemy as sa
print(pyodbc.version) # 4.0.30
table_name = "gh_pyodbc_262"
num_rows = 400
print(f" num_rows: {num_rows}") # 400
df = pd.DataFrame(
[(i, f"row{i}") for i in range(num_rows)], columns=["id", "txt"]
)
engine = sa.create_engine(
"mssql+pyodbc://#mssqlLocal64", fast_executemany=True
)
df.to_sql(
table_name,
engine,
index=False,
if_exists="replace",
)
with engine.connect() as conn:
row_count = conn.execute(
sa.text(f"SELECT COUNT(*) FROM {table_name}")
).scalar()
print(f"row_count: {row_count}") # 400
and indeed will work for thousands and even millions of rows. (I did a successful test with 5_000_000 rows.)
Ok, this seems to be an issue with SQL Server itself.
SQL Server does not finish execution of a large batch of SQL statements

How to insert a row into a table in MS SQL using Python pandas

when trying to insert a row into a table in MS SQL using Python pandas, I got the error " 'nonetype' object is not iterable" when trying to execute the INSERT query in python.I use Python 3.6 and microsoft sql server management studio 2008
my code:
import pyodbc
import pandas as pd
server = 'ACER'
db = 'fin'
# Create the connection
conn = pyodbc.connect('DRIVER={SQL Server};SERVER=' + server + ';DATABASE=' + db + ';Trusted_Connection=yes')
# query db
sql = """INSERT INTO [fin].[dbo].[items] (itemdate, itemtype, name, amount) VALUES('2017-04-01','income','bonus',350) """
#df = pd.read_sql(sql, conn)
df = pd.read_sql(sql, conn)
print(df.to_string())
Somebody suggested using SET NOCOUNT ON, so I tried to modify the query to:
sql = """ SET NOCOUNT ON
---
INSERT INTO [fin].[dbo].[items] (itemdate, itemtype, name, amount) VALUES('2017-04-01','income','bonus',350) """.split("---")
but the execution failed.

Resources