I have a Spark Dataframe that I want to push to an SQL table on a remote server. The table has an Id column that is set as an identity column. The Dataframe I want to push also has as Id column, and I want to use those Ids in the SQL table, without removing the identity option for the column.
I write the dataframe like this:
df.write.format("jdbc") \
.mode(mode) \
.option("url", jdbc_url) \
.option("dbtable", table_name) \
.option("user", jdbc_username) \
.option("password", jdbc_password) \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.save()
But I get the following response:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 41.0 failed 4 times, most recent failure: Lost task 0.3 in stage 41.0 (TID 41, 10.1.0.4, executor 0): java.sql.BatchUpdateException: Cannot insert explicit value for identity column in table 'Table' when IDENTITY_INSERT is set to OFF.
I have tried to add a query to the writing like:
query = f"SET IDENTITY_INSERT Table ON;"
df.write.format("jdbc") \
.mode(mode) \
.option("url", jdbc_url) \
.option("query", query) \
.option("dbtable", table_name) \
.option("user", jdbc_username) \
.option("password", jdbc_password) \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.save()
But that just throws an SQL syntax error:
IllegalArgumentException: Both 'dbtable' and 'query' can not be specified at the same time.
Or if I try to run a read with the query first:
com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near the keyword 'SET'.
This must be because it only supports SELECT statements.
Is it possible to do in Spark, or would I need to use a different connector and combine the setting of the insert identity on, together with regular insert into statements?
I would prefer a solution that allowed me to keep writing through the Spark context. But I am open to other solutions.
One way to work around this issue is the following:
Save your dataframe as a temporary table in your database.
Set identity insert to ON.
Insert into your real table the content of your temporary table.
Set identity insert to OFF.
Drop your temporary table.
Here's a pseudo code example:
tablename = "MyTable"
tmp_tablename = tablename+"tmp"
df.write.format("jdbc").options(..., dtable=tmp_tablename).save()
columns = ','.join(df.columns)
query = f"""
SET IDENTITY_INSERT {tablename} ON;
INSERT INTO {tablename} ({columns})
SELECT {columns} FROM {tmp_tablename};
SET IDENTITY_INSERT {tablename} OFF;
DROP TABLE {tmp_tablename};
"""
execute(query) # You can use Cursor from pyodbc for example to execute raw SQL queries
Related
I'm setting up a dataproc job to query some tables from BigQuery, but while I am able to retrieve data from BigQuery, using the same syntax does not work for retrieving data from an External Connection within my BigQuery project.
More specifically, I'm using the query below to retrieve event data from the analytics of my project:
PROJECT = ... # my project name
NUMBER = ... # my project's analytics number
DATE = ... # day of the events in the format YYYYMMDD
analytics_table = spark.read \
.format('com.google.cloud.spark.bigquery') \
.option('table', f'{PROJECT}.analytics_{NUMBER}.events_{DATE}') \
.load()
While the query above works perfectly, I am unable to query to an external connection of my project. I'd like to be able to do something like:
DB_NAME = ... # my database name, considering that my Connection ID is
# projects/<PROJECT_NAME>/locations/us-central1/connections/<DB_NAME>
my_table = spark.read \
.format('com.google.cloud.spark.bigquery') \
.option('table', f'{PROJECT}.{DB_NAME}.my_table') \
.load()
Or even like this:
query = 'SELECT * FROM my_table'
my_table = spark.read \
.format('com.google.cloud.spark.bigquery') \
.option('query', query) \
.load()
How can I retrieve this data?
Thanks in advance :)
I am trying to write the spark dataframe into Azure Syanpse database.
My code:
try:
re_spdf.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.mode("append") \
.option("url", url) \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", password) \
.option("encrypt", 'True') \
.option("trustServerCertificate", 'false') \
.option("hostNameInCertificate", '*.database.windows.net') \
.option("mssqlIsolationLevel", "READ_UNCOMMITTED") \
.option('driver', 'com.microsoft.sqlserver.jdbc.SQLServerDriver')\
.save()
except ValueError as error :
print("Connector write failed", error)
Error message:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 1 in stage 29.0 failed 4 times, most recent failure:
Lost task 1.3 in stage 29.0 (TID 885, 10.139.64.8, executor 0):
com.microsoft.sqlserver.jdbc.SQLServerException:
PdwManagedToNativeInteropException ErrorNumber: 46724, MajorCode: 467,
MinorCode: 24, Severity: 20, State: 2, Exception of type
'Microsoft.SqlServer.DataWarehouse.Tds.PdwManagedToNativeInteropException' was thrown.
Even I googled this error message. I didnt get any useful solution.
Update: My working environment is Databricks pyspark notebook.
Any suggestions would be appreciated.
There is some column length limitation in the synapse DB table. It will allow only 4000 characters.
so When I use the com.databricks.spark.sqldw since it uses Polybase as the connector, I need to change the length of the column in DB table as well.
reference:https://forums.databricks.com/questions/21032/databricks-throwing-error-sql-dw-failed-to-execute.html
code:
df.write \
.format("com.databricks.spark.sqldw") \
.mode("append") \
.option("url", url) \
.option("user", username) \
.option("password", password) \
.option("maxStrLength", "4000" ) \
.option("tempDir", "tempdirdetails") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("mssqlIsolationLevel", "READ_UNCOMMITTED") \
.option("dbTable", table_name) \
.save()
Azure databricks documentation says format com.databricks.spark.sqldw to read/write data from/to data from an Azure Synapse table.
If you are using Synapse, why not Synapse notebooks and then writing the dataframe is as easy as calling synapsesql, eg
%%spark
df.write.synapsesql("yourPool.dbo.someXMLTable_processed", Constants.INTERNAL)
You would save yourself some trouble and performance should be good as it's parallelised. This is the main article:
https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/synapse-spark-sql-pool-import-export
I want to select from a view which is visible when activating Oracle edition feature.
alter session set EDITION=MYEDITION
view1
view1_edition1 => this view is only visible after the alter session statement above which is Oracle edition feature.
In TOAD all works fine. I fire alter session statement above and I can successfully select from that view.
I am trying to achieve this in Spark but it doesn't work. selecting from view1_edition1 return object doesnt exists.
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:oracle:thin:#db_server:1520/SERVICE") \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("dbtable", "(select name from schema1.view1_edition1)") \
.option("user", "user") \
.option("password", "password") \
.option("sessionInitStatement","""alter session set EDITION=MYEDITION""") \
.load()
Just to prove that edition is active I fire the following select above in spark.read and it returns correct edition as active
(
WITH a AS (SELECT name FROM SCHEMA1.TABLE1)
,b AS (SELECT SYS_CONTEXT('USERENV', 'SESSION_EDITION_NAME') AS
edition FROM DUAL)
SELECT name,edition
FROM a
CROSS JOIN b
)
Instead of alter session, try setting the edition name using Java properties before getting a connection from the Oracle JDBC datasource
e.g.,
p.put("oracle.jdbc.editionName", "MyEdition");
...
ods.setConnectionProperties(p);
I am using pyspark to load csv to redshift. I want to query how manny rows got added.
I create a new column using the withcolumn function:
csvdata=df.withColumn("file_uploaded", lit("test"))
I see that this column gets created and I can query using psql. But, when I try to query using pyspark sql context, I get a error:
py4j.protocol.Py4JJavaError: An error occurred while calling o77.showString.
: java.sql.SQLException: [Amazon](500310) Invalid operation: column "test" does not exist in billingreports;
Interestingly, I am able to query other columnns but not just the new column I added.
Appreciate any pointers on how to resolve this issue.
Complete code:
df=spark.read.option("header","true").csv('/mnt/spark/redshift/umcompress/' +
filename)
csvdata=df.withColumn("fileuploaded", lit("test"))
countorig=csvdata.count()
## This executes without error
csvdata.write \
.format("com.databricks.spark.redshift") \
.option("url", jdbc_url) \
.option("dbtable", dbname) \
.option("tempformat", "CSV") \
.option("tempdir", "s3://" + s3_bucket + "/temp") \
.mode("append") \
.option("aws_iam_role", iam_role).save()
select="select count(*) from " + dbname + " where fileuploaded='test'"
## Error occurs
df = spark.read \
.format("com.databricks.spark.redshift") \
.option("url", jdbc_url) \
.option("query", select) \
.option("tempdir", "s3://" + s3_bucket + "/test") \
.option("aws_iam_role", iam_role) \
.load()
newcounnt=df.count()
Thanks for responding.
Dataframe does have the new column called file_uploaded
Here is the query:
select="select count(*) from billingreports where file_uploaded='test'"
I have printed the schema
|-- file_uploaded: string (nullable = true)
df.show() shows that the new column is added.
I just want to add a pre determined string to this column as value.
Your Dataframe csvdata will have a new column named file_uploaded, with default value "test" in all rows of df. This error shows that it is trying to access a column named test, which does not exist in the dataframe billingreports thus the error. Print the schema before querying the column with billingreports.dtypes or better try to take a sample of your dataframe with billingreports.show() and see if the column has correct name and values.
It will be better if you share the query which resulted in this exception, as the exception is thrown for dataframe billingreports.
I want to write a Spark DataFrame to an Oracle table by using Oracle JDBC driver. My code is listed below:
url = "jdbc:oracle:thin:#servername:sid"
mydf.write \
.mode("overwrite") \
.option("truncate", "true") \
.format("jdbc") \
.option("url", url) \
.option("driver", "oracle.jdbc.OracleDriver") \
.option("createTableColumnTypes", "desc clob, price double") \
.option("user", "Steven") \
.option("password", "123456") \
.option("dbtable", "table1").save()
What I want is to specify the desc column to clob type and the price column to double precision type. But Spark show me that the clob type is not supported. The length of desc string is about 30K. I really need your help. Thanks
As per this note specifies that there are some data types that are not supported. If the target table is already created with CLOB data type then createTableColumnTypes may be redundant. You can check if writing to a CLOB column is possible with spark jdbc if table is already created.
Create your table in mysql with your required schema , now use mode='append' and save records .
mode='append' only insert records without modify table schema.