Double values getting to exponential values when inserting data from Azure Databricks to Azure SQL database - azure

I'm trying to load data data from Azure Databricks into Azure SQL database table via jdbc. Data is loading fine but the double values in Azure Databricks when inserted in to SQL table is getting converted to exponential values. I have tried different datatypes in SQL database such as nvarchar, varchar, float and it gets converted to exponential values.
However, when i try using the decimal data type in Azure SQL database it loads data into the column without exponential values but giving me extra zeros in the end.
The command that i'm using in Databricks is:
%scala
spark.sql("select ID from customers")
.write
.mode(SaveMode.Append) // <--- Append to the existing table
.jdbc(jdbcUrl,stg.customers, connectionProperties)
Some of the examples stored in Azure Dataricks ID column are:
ID
1900845009567889.12
2134012183812321
When using numeric(38,15) data type in Azure SQL Database it is giving me the following output:
|ID|
|:--|
|1900845009567889.1200000000000000|
|2134012183812321.0000000000000000|
I don't want the extra zeros in the end. Also, data in the Databricks table is not properly defined so i cannot say if numeric(38,15) would suffice or not.
I also tried storing the data in Azure Databricks ID column as String datatype and then tried loading that into a varchar or nvarchar datatype in SQL table. But, still it is converting the data into exponential values.
Can anyone please suggest if there is any easy way to load this data from Azure Databricks to Azure SQL database?

I cannot say if numeric (38,15) would suffice or not
Before SQL Server 2016, the conversion of numeric is restricted up to the precision of 17 digits only.
Now there is no restriction after this version of SQL Server 2016 (13.x).
Generic Example:
Below is the simplified example from Microsoft Documentation which addresses the numeric data types of values how it will inserted into the respective columns.
CREATE TABLE dbo.Table ( DecimalColumn DECIMAL(5,2) ,NumericColumn NUMERIC(10,5) );
GO
INSERT INTO dbo.Table VALUES (123, 12345.12); GO SELECT DecimalColumn, NumericColumn FROM dbo.Table;
Result for the above SQL Query:
I don't want the extra zeros in the end.
In SQL Server we can use the float data type to exclude the extra zeros from the end, i.e., (scale).
For this we have to cast it to a float type in order to display purpose only.
SELECT CAST(12345.1200000 as float)
Output:
12345.12
Reference for the same above to Exclude the extra zeros.

Related

Incremental load in Azure Data Factory

I am replicating my data from Azure SQl DB TO Azure SQL DB. I have some tables with date columns and some tables with just the ID columns which are assigning primary key. While performing incremental load in ADF, I can select date as watermark column for the tables which have date column and id as watermark column for the tables which has id column, But the issue is my id has guid values, So can I i take that as my watermark column ? and if yes while copy activity process it gives me following error in ADF
Please see the image for above reference
How can I overcome this issue. Help is appreciated
Thank you
Gp
I have tried dynamic mapping https://martinschoombee.com/2022/03/22/dynamic-column-mapping-in-azure-data-factory/ from here but it does not work it still gives me same error.
Regarding your question about watermak:
A watermark is a column that has the last updated time stamp or an incrementing key
So GUID column would not be a good fit.
Try to find a date column, or an integer identity which is ever incrementing, to use as watermark.
Since your source is SQL server, you can also use change data capture.
Links:
Incremental loading in ADF
Change data capture
Regards,
Chen
The watermark logic takes advantange of the fact that all the new records which are inserted after the last watermark saved should only be considered for copying from source A to B , basically we are using ">=" operator to our advantage here .
In case of guid you cannot use that logic as guid cann surely be unique but not ">=" or "=<" will not work.

Time function in Azure Data Factory - Expression Builder

I only need to take the time part from the 'Timestamp type source attribute' and load it into a dedicated SQL pool table (Time datatype column). But I don't find a time function within the expression builder in ADF, is there a way I can do it?
-What did I do?
-I took the time part from the source attribute using substring and then tried to load the same into the destination table, when I do the destination table inserted null values as the column at the destination table is set to time datatype.
I tried to reproduce this and got the same issue. The following is a demonstration of the same. I have a table called mydemo as shown below.
CREATE TABLE [dbo].[mydemo]
(
id int NOT NULL,
my_date date,
my_time time
)
WITH
(
DISTRIBUTION = HASH (id),
CLUSTERED COLUMNSTORE INDEX
)
GO
The following is my source data in my dataflow.
time is not a recognized datatype in azure dataflow (date and timestamp are accepted). Therefore, dataflow fails to convert string (substring(<timestamp_col>,12,5)) into time type.
For better understanding, you can load your sink table as source in dataflow. The time column will be read as 1900-01-01 12:34:56 when time value in the table row is 12:34:56.
#my table row
insert into mydemo values(200,'2022-08-18','12:34:56')
So, instead of using substring(<timestamp_col>,12,5) to return 00:01, use concat('1900-01-01 ',substring(<timestamp_col>,12,8)) which returns 1900-01-01 00:01:00.
Configure the sink, mapping and look at the resulting data in data preview. Now, azure dataflow will be able to successfully insert the values and give desired results.
The following is the output after successful insertion of record into dedicated pool table.
NOTE: You can construct valid yyyy-MM-dd hh:mm:ss as a value using concat('yyyy-MM-dd ',substring(<timestamp_col>,12,8)) in place of 1900-01-01 hh:mm:ss in derived column transformation.

spark thrift server issue: Length field is empty for varchar fields

I am trying to read Data from Spark Thrift Server using SAS. In the table definition through DBeaver, I am seeing that Length field is empty only for fields with VARCHAR data type. I can see the length in the Data Type field as varchar(32). But that doesn't suffice my purpose as the SAS application taps into the Length field. Since, this field is not populated now, SAS is defaulting to the max size and as a result its becoming extremely slow. I get the length field populated in Hive.

Incremental load without date or primary key column using azure data factory

I am having a source lets say SQL DB or an oracle database and I wanted to pull the table data to Azure SQL database. But the problem is I don't have any date column on which data is getting inserting or a primary key column. So is there any other way to perform this operation.
One way of doing it semi-incremental is to partition the table by a fairly stable column in the source table, then you can use mapping data flow to compare the partitions ( can be done with row counts, aggregations, hashbytes etc ). Each load you store the compare output in the partitions metadata somewhere to be able to compare it again the next time you load. That way you can reload only the partitions that were changed since your last load.

implementing scd type 2 as a generalized stored procedure in azure data warehouse for all dimensions?

I am new to azure and I am working on Azure data warehouse. I have loaded few dimensions and staging tables. I want to implement SCD type 2 as a generalised procedure for all the updates with hashbytes. As we know, ADW doesnt support merge, I am trying to implement this with normal insert and update statements with a startdate and enddate column. But the schemas of the dimension table are not exactly the same as those in the staging table, there are few columns that are not considered.
Initially i thought i will pass in the staging and dimension table as parameters, and fetch schema from sys objects, create a temp table, load necessary column from stage and do a hashbyte and compare hashbyte from temp and dimension, but is this a good approach?
PS: Also one more problem, is the sometimes the column names are mapped different, like branchid as branch_id. How do i fetch columns for these. Note that this is just one case, and this could be the case in many tables as well.

Resources