I'm trying to load the data from Salesforce table to ADLS path. To perform this I'm using SOQL formatted query in the source dataset(Salesforce) of ADF pipeline copy activity. Sample below.
Select distinct `col1`, `col2`, `col3`....... from table
This pipeline is working for all the tables except two table where it is failing with HybridDeliveryException (Exact error below)
I also tried pulling only 10 rows. still no luck. But for the same table is working without any issues by selecting all columns -> select * from table
Any suggestions greatly appreciated
Error:
Failure happened on 'Source' side. ErrorCode=UserErrorOdbcOperationFailed,'Type=Microsoft.DataTransfer.Common.Shared. ,Message=ERROR [HY000] [Microsoft][DSI] (20051) Internal error using swap file "D:\Users_azbatchtask_410\AppData\Local\Temp\a60f5b9a-da9c-47b3-9d03-14d64bf44dce.tmp" in "Simba::DSI::DiskSwapDevice::DoFlushBlock": "[Microsoft][Support] (40635) Simba::Support::BinaryFile: Write of 57168 bytes on file "D:\Users_azbatchtask_410\AppData\Local\Temp\a60f5b9a-da9c-47b3-9d03-14d64bf44dce.tmp" failed: No space left on device".,Source=Microsoft.DataTransfer.ClientLibrary.Odbc.OdbcConnector,''Type=System.Data.Odbc.OdbcException,Message=ERROR [HY000] [Microsoft][DSI] (20051) Internal error using swap file "D:\Users_azbatchtask_410\AppData\Local\Temp\a60f5b9a-da9c-47b3-9d03-14d64bf44dce.tmp" in "Simba::DSI::DiskSwapDevice::DoFlushBlock": "[Microsoft][Support] (40635) Simba::Support::BinaryFile: Write of 57168 bytes on file "D:\Users_azbatchtask_410\AppData\Local\Temp\a60f5b9a-da9c-47b3-9d03-14d64bf44dce.tmp" failed: No space left on device".,Source=Microsoft Salesforce ODBC Driver,'
This might not be a complete answer but this be helpful for someone as a workaround.
I ran some more tests today and when I remove the key word "distinct" in the SOQL statement, the query is working fine and no exceptions this time.
Seems like that the issue is occurring with only specific large tables.
But the SOQL with distinct (Select distinct col1, col2, col3.......) is working fine for other smaller tables.
Related
I am trying to extract data from SAP using SAP CDC Connector in ADF. The source data looks something like this.
START_DT|PROD_NAME|END_DT
20201230165830.0|BBEESABX|20180710143703.0
When we perform a preview data on the source, we are getting data just like above. But while performing copy via copy activity, below failure is observed :-
Failure happened on 'Source' side. ErrorCode=SapParsingDataFailure,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Failed when parsing data, parsing value: 'ESABX 201807', expected data type 'Microsoft.DataTransfer.Common.Shared.ClrTypeCode'.Please check your origin data in SAP side,Source=Microsoft.DataTransfer.Runtime.SapRfcHelper,''Type=System.FormatException,Message=Input string was not in a correct format.,Source=mscorlib,'
I have tried several combination and changes on sink side such as changing parquet to csv, changing Copy behavior to all available options...but nothing seems to work.
Probably you have hiding fields in the SAP Extractor? (RSA6). Try this workaround, make a selection of all fields in the SAP CDC connector and run it again.
Is there any reason this command works well:
%sql SELECT * FROM Azure.Reservations WHERE timestamp > '2021-04-02'
returning 2 rows, while the below:
%sql DELETE FROM Azure.Reservations WHERE timestamp > '2021-04-02'
fails with:
Error in SQL statement: AssertionError: assertion failed: No plan for
DeleteFromTable (timestamp#394 > 1617321600000000)
?
I'm new to Databricks but I'm sure I ran similar command on another table (without WHERE clause). The table is created basing on a Parquet file.
DELETE FROM (and similarly UPDATE, or MERGE) aren't supported on the Parquet files - right now on Databricks it's supported for Delta format. You can convert your parquet files into delta using CONVERT TO DELTA, and then this command will work for you.
Another alternative is to implement it is to read parquet files, filter out the rows that you want to keep, and overwrite your parquet files.
It could be that you are trying to DELETE from a VIEW (in case it is not a parquet file)
Unfortunately, there is no easy way to differentiate between a VIEW and a TABLE in databricks; the only way you can test if it's indeed a view is by:
SHOW VIEWS FROM Azure like 'reser*'
or, if it's a table:
SHOW TABLES FROM Azure like 'reser*'
Show tables syntax
Show views syntax
just delete from the delta
%sql
delete from delta.`/mnt/path`
where x
I have a small log dataframe which has metadata regarding the ETL performed within a given notebook, the notebook is part of a bigger ETL pipeline managed in Azure DataFactory.
Unfortunately, it seems that Databricks cannot invoke stored procedures so I'm manually appending a row with the correct data to my log table.
however, I cannot figure out the correct sytnax to update a table given a set of conditions :
the statement I use to append a single row is as follows :
spark_log.write.jdbc(sql_url, 'internal.Job',mode='append')
this works swimmingly however, as my Data Factory is invoking a stored procedure,
I need to work in a query like
query = f"""
UPDATE [internal].[Job] SET
[MaxIngestionDate] date {date}
, [DataLakeMetadataRaw] varchar(MAX) NULL
, [DataLakeMetadataCurated] varchar(MAX) NULL
WHERE [IsRunning] = 1
AND [FinishDateTime] IS NULL"""
Is this possible ? if so can someone show me how?
Looking at the documentation this only seems to mention using select statements with the query parameter :
Target Database is an Azure SQL Database.
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
just to add this is a tiny operation, so performance is a non-issue.
You can't do single record updates using jdbc in Spark with dataframes. You can only append or replace the entire table.
You can do updates using pyodbc- requires installing the MSSQL ODBC driver (How to install PYODBC in Databricks) or you can use jdbc via JayDeBeApi (https://pypi.org/project/JayDeBeApi/)
I have a table in hive
db.table_name
When I run the following in hive I get results back
SELECT * FROM db.table_name;
When I run the following in a spark-shell
spark.read.table("db.table_name").show
It shows nothing. Similarly
sql("SELECT * FROM db.table_name").show
Also shows nothing. Selecting arbitrary columns out before the show also displays nothing. Performing a count states the table has 0 rows.
Running the same queries works against other tables in the same database.
Spark Version: 2.2.0.cloudera1
The table is created using
table.write.mode(SaveMode.Overwrite).saveAsTable("db.table_name")
And if I read the file using the parquet files directly it works.
spark.read.parquet(<path-to-files>).show
EDIT:
I'm currently using a workaround by describing the table and getting the location and using spark.read.parquet.
Have you refresh metadata table? Maybe you need to refresh table to access to new data.
spark.catalog.refreshTable("my_table")
I solved the problem by using
query_result.write.mode(SaveMode.Overwrite).format("hive").saveAsTable("table")
which stores the results in textfile.
There is probably some incompatibility with the Hive parquet.
I also found a Cloudera report about it (CDH Release Notes): they recommend creating the Hive table manually and then load data from a temporary table or by query.
I have a PySpark Code which writes into SQL Server database like this
df.write.jdbc(url=url, table="AdventureWorks2012.dbo.people", properties=properties)
However problem is that I want to keep writing in the table people even if the table exist and I see in the Spark Document that there are possible options error, append, overwrite and ignore for mode and all of them throws error, the object already exist if the table already exist in the database.
Spark throw following error
py4j.protocol.Py4JJavaError: An error occurred while calling o43.jdbc.
com.microsoft.sqlserver.jdbc.SQLServerException: There is already an object named 'people' in the database
Is there way to write data into the table even if the table already exits ?
Please let me know you need more explanation
For me the issue was with Spark 1.5.2. The way it checks if the table exists (here) is by running SELECT 1 FROM $table LIMIT 1. If the query fails, the tables doesn't exist. That query failed even when the table was there.
This was changed to SELECT * FROM $table WHERE 1=0 in 1.6.0 (here).
So append and overwrite mode will not throw an error when the table already exists. From the spark documentation ( http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes ) SaveMode.Append will "When saving a DataFrame to a data source, if data/table already exists, contents of the DataFrame are expected to be appended to existing data." and SaveMode.Overwrite will "Overwrite mode means that when saving a DataFrame to a data source, if data/table already exists, existing data is expected to be overwritten by the contents of the DataFrame." Depending on how you want to handle the existing table one of these two should likely meet your needs.