Curious to find out what the best way is to generate relationship identities through ADF.
Right now, I'm consuming JSON data that does not have any identity information. This data is then transformed into multiple database sink tables with relationships (1..n, etc.). Due to FK constraints on some of the destination sink tables, these relationships need to be "built up" one at a time.
This approach seems a bit kludgy, so I'm looking to see if there are other options that I'm not aware of.
Note that I need to include the Surrogate key generation for each insert. If I do not do this, based on output database schema, I'll get a 'cannot insert PK null' error.
Also note that I turn IDENTITY_INSERT ON/OFF for each sink.
I would tend to take more of an ELT approach and use the native JSON abilites in Azure SQL DB, ie OPENJSON. You could land the JSON in a table in Azure SQL DB using ADF (eg a Stored Proc activity) and then call another stored proc to process the JSON, something like this:
-- Setup
DROP TABLE IF EXISTS #tmp
DROP TABLE IF EXISTS import.City;
DROP TABLE IF EXISTS import.Region;
DROP TABLE IF EXISTS import.Country;
GO
DROP SCHEMA IF EXISTS import
GO
CREATE SCHEMA import
CREATE TABLE Country ( CountryKey INT IDENTITY PRIMARY KEY, CountryName VARCHAR(50) NOT NULL UNIQUE )
CREATE TABLE Region ( RegionKey INT IDENTITY PRIMARY KEY, CountryKey INT NOT NULL FOREIGN KEY REFERENCES import.Country, RegionName VARCHAR(50) NOT NULL UNIQUE )
CREATE TABLE City ( CityKey INT IDENTITY(100,1) PRIMARY KEY, RegionKey INT NOT NULL FOREIGN KEY REFERENCES import.Region, CityName VARCHAR(50) NOT NULL UNIQUE )
GO
DECLARE #json NVARCHAR(MAX) = '{
"Cities": [
{
"Country": "England",
"Region": "Greater London",
"City": "London"
},
{
"Country": "England",
"Region": "West Midlands",
"City": "Birmingham"
},
{
"Country": "England",
"Region": "Greater Manchester",
"City": "Manchester"
},
{
"Country": "Scotland",
"Region": "Lothian",
"City": "Edinburgh"
}
]
}'
SELECT *
INTO #tmp
FROM OPENJSON( #json, '$.Cities' )
WITH
(
Country VARCHAR(50),
Region VARCHAR(50),
City VARCHAR(50)
)
GO
-- Add the Country first (has no foreign keys)
INSERT INTO import.Country ( CountryName )
SELECT DISTINCT Country
FROM #tmp s
WHERE NOT EXISTS ( SELECT * FROM import.Country t WHERE s.Country = t.CountryName )
-- Add the Region next including Country FK
INSERT INTO import.Region ( CountryKey, RegionName )
SELECT t.CountryKey, s.Region
FROM #tmp s
INNER JOIN import.Country t ON s.Country = t.CountryName
-- Now add the City with FKs
INSERT INTO import.City ( RegionKey, CityName )
SELECT r.RegionKey, s.City
FROM #tmp s
INNER JOIN import.Country c ON s.Country = c.CountryName
INNER JOIN import.Region r ON s.Region = r.RegionName
AND c.CountryKey = r.CountryKey
SELECT * FROM import.City;
SELECT * FROM import.Region;
SELECT * FROM import.Country;
This is a simple test script designed to show the idea and should run end-to-end but it is not production code.
Related
I'm working off the example from the jooq blog post: https://blog.jooq.org/jooq-3-15s-new-multiset-operator-will-change-how-you-think-about-sql/. I've got the same setup: a foreign key relationship, and I want to load the Parent table and also get all rows that reference each of the Parent rows:
CREATE TABLE parent (
id BIGINT NOT NULL,
CONSTRAINT pk_parent PRIMARY KEY (id)
);
CREATE TABLE item (
id BIGINT NOT NULL,
parent_id BIGINT NOT NULL,
type VARCHAR(255) NOT NULL,
CONSTRAINT pk_item PRIMARY KEY (id),
FOREIGN KEY (parent_id) REFERENCES parent (id)
);
This is what I think the jooq query should look like:
#Test
public void test() {
dslContext.insertInto(PARENT, PARENT.ID).values(123L).execute();
dslContext.insertInto(PARENT, PARENT.ID).values(456L).execute();
dslContext.insertInto(ITEM, ITEM.ID, ITEM.PARENT_ID, ITEM.TYPE).values(1L, 123L, "t1").execute();
dslContext.insertInto(ITEM, ITEM.ID, ITEM.PARENT_ID, ITEM.TYPE).values(2L, 456L, "t2").execute();
var result = dslContext.select(
PARENT.ID,
DSL.multiset(
DSL.select(
ITEM.ID,
ITEM.PARENT_ID,
ITEM.TYPE)
.from(ITEM)
.join(PARENT).onKey()))
.from(PARENT)
.fetch();
System.out.println(result);
}
The result is that each Item shows up for each Parent:
Executing query : select "PUBLIC"."PARENT"."ID", (select coalesce(json_arrayagg(json_array("v0", "v1", "v2" null on null)), json_array(null on null)) from (select "PUBLIC"."ITEM"."ID" "v0", "PUBLIC"."ITEM"."PARENT_ID" "v1", "PUBLIC"."ITEM"."TYPE" "v2" from "PUBLIC"."ITEM" join "PUBLIC"."PARENT" on "PUBLIC"."ITEM"."PARENT_ID" = "PUBLIC"."PARENT"."ID") "t") from "PUBLIC"."PARENT"
Fetched result : +----+----------------------------+
: | ID|multiset |
: +----+----------------------------+
: | 123|[(1, 123, t1), (2, 456, t2)]|
: | 456|[(1, 123, t1), (2, 456, t2)]|
: +----+----------------------------+
Fetched row(s) : 2
I also tried using doing an explicit check for Parent.id == Item.parent_id, but it didn't generate valid SQL:
var result =
dslContext
.select(
PARENT.ID,
DSL.multiset(
DSL.select(ITEM.ID, ITEM.PARENT_ID, ITEM.TYPE)
.from(ITEM)
.where(ITEM.PARENT_ID.eq(PARENT.ID))))
.from(PARENT)
.fetch();
Error:
jOOQ; bad SQL grammar [select "PUBLIC"."PARENT"."ID", (select coalesce(json_arrayagg(json_array("v0", "v1", "v2" null on null)), json_array(null on null)) from (select "PUBLIC"."ITEM"."ID" "v0", "PUBLIC"."ITEM"."PARENT_ID" "v1", "PUBLIC"."ITEM"."TYPE" "v2" from "PUBLIC"."ITEM" where "PUBLIC"."ITEM"."PARENT_ID" = "PUBLIC"."PARENT"."ID") "t") from "PUBLIC"."PARENT"]
org.springframework.jdbc.BadSqlGrammarException: jOOQ; bad SQL grammar [select "PUBLIC"."PARENT"."ID", (select coalesce(json_arrayagg(json_array("v0", "v1", "v2" null on null)), json_array(null on null)) from (select "PUBLIC"."ITEM"."ID" "v0", "PUBLIC"."ITEM"."PARENT_ID" "v1", "PUBLIC"."ITEM"."TYPE" "v2" from "PUBLIC"."ITEM" where "PUBLIC"."ITEM"."PARENT_ID" = "PUBLIC"."PARENT"."ID") "t") from "PUBLIC"."PARENT"]
at org.jooq_3.17.6.H2.debug(Unknown Source)
What am I doing wrong here?
Correlated derived table support
There are numerous SQL dialects that can emulate MULTISET in principle, but not if you correlate them like you did. According to #12045, these dialects do not support correlated derived tables:
Db2
H2
MariaDB (see https://jira.mariadb.org/browse/MDEV-28196)
MySQL 5.7 (8 can handle it)
Oracle 11g (12c can handle it)
#12045 was fixed in jOOQ 3.18, producing a slighly less robust and more limited MULTISET emulation that only works in the absence of:
DISTINCT
UNION (and other set operations)
OFFSET .. FETCH
GROUP BY and HAVING
WINDOW and QUALIFY
But that probably doesn't affect 95% of all MULTISET usages.
Workarounds
You could use MULTISET_AGG, which doesn't suffer from this limitation (but is generally less powerful)
You could stop using H2, in case you're using that only as a test database (jOOQ recommends you integration test directly against your target database. This is a prime example of why this is generally better)
You could upgrade to 3.18.0-SNAPSHOT for the time being (built off github, or available from here, if you're licensed: https://www.jooq.org/download/versions)
I have two kind of record mention below in my table staudentdetail of cosmosDb.In below example previousSchooldetail is nullable filed and it can be present for student or not.
sample record below :-
{
"empid": "1234",
"empname": "ram",
"schoolname": "high school ,bankur",
"class": "10",
"previousSchooldetail": {
"prevSchoolName": "1763440",
"YearLeft": "2001"
} --(Nullable)
}
{
"empid": "12345",
"empname": "shyam",
"schoolname": "high school",
"class": "10"
}
I am trying to access the above record from azure databricks using pyspark or scala code .But when we are building the dataframe reading it from cosmos db it does not bring previousSchooldetail detail in the data frame.But when we change the query including id for which the previousSchooldetail show in the data frame .
Case 1:-
val Query = "SELECT * FROM c "
Result when query fired directly
empid
empname
schoolname
class
Case2:-
val Query = "SELECT * FROM c where c.empid=1234"
Result when query fired with where clause.
empid
empname
school name
class
previousSchooldetail
prevSchoolName
YearLeft
Could you please tell me why i am not able to get previousSchooldetail in case 1 and how should i proceed.
As #Jayendran, mentioned in the comments, the first query will give you the previouschooldetail document wherever they are available. Else, the column would not be present.
You can have this column present for all the scenarios by using the IS_DEFINED function. Try tweaking your query as below:
SELECT c.empid,
c.empname,
IS_DEFINED(c.previousSchooldetail) ? c.previousSchooldetail : null
as previousSchooldetail,
c.schoolname,
c.class
FROM c
If you are looking to get the result as a flat structure, it can be tricky and would need to use two separate queries such as:
Query 1
SELECT c.empid,
c.empname,
c.schoolname,
c.class,
p.prevSchoolName,
p.YearLeft
FROM c JOIN c.previousSchooldetail p
Query 2
SELECT c.empid,
c.empname,
c.schoolname,
c.class,
null as prevSchoolName,
null as YearLeft
FROM c
WHERE not IS_DEFINED (c.previousSchooldetail) or
c.previousSchooldetail = null
Unfortunately, Cosmos DB does not support LEFT JOIN or UNION. Hence, I'm not sure if you can achieve this in a single query.
Alternatively, you can create a stored procedure to return the desired result.
I'm gonna guess no, but secondary indexes seem a lot like tables in that you can directly select from them FORCE_INDEX and even JOIN on them:
JOIN MyTable#{FORCE_INDEX=anIndexToUseFromMyTable} AS myTable
So maybe you can create a new table interleaved into an index?
Example
CREATE TABLE Foo (
primaryId STRING(64) NOT NULL,
secondaryId STRING(64) NOT NULL,
modifiedAt TIMESTAMP NOT NULL OPTIONS (allow_commit_timestamp=true),
) PRIMARY KEY (primaryId);
-- Index we would like to interleave into for another table
CREATE INDEX FooSecondaryIdIndex ON Foo(secondaryId);
-- interleave this table into the index above
-- and support DELETE CASCADE
CREATE TABLE Bar (
secondaryId STRING(64) NOT NULL,
extraData STRING(64) NOT NULL,
modifiedAt TIMESTAMP NOT NULL OPTIONS (allow_commit_timestamp=true),
) PRIMARY KEY (secondaryId),
INTERLEAVE IN PARENT Foo#{FORCE_INDEX=FooSecondaryIdIndex} ON DELETE CASCADE;
Well... it doesn’t look like that is supported:
Error parsing Spanner DDL statement: CREATE TABLE Bar ( secondaryId STRING(64) NOT NULL, extraData STRING(64) NOT NULL, modifiedAt TIMESTAMP NOT NULL OPTIONS (allow_commit_timestamp=true), ) PRIMARY KEY (secondaryId), INTERLEAVE IN PARENT Foo#{FORCE_INDEX=FooSecondaryIdIndex} ON DELETE CASCADE : Syntax error on line 6, column 25: Expecting 'EOF' but found '#'
I have the following format in json to store lawyers, I have doubts how to model in postgres the field "specialties" which has array of object each one with a title and a subarrays of subspecialties:
{
"id": 1,
"name": "John Johnson Johannes",
"gender": "f",
"specialties": [
{
"specialty": "Business law",
"sub-specialties": [
"Incorporation",
"Taxes",
"Fusions"
]
},
{
"specialty": "Criminal law",
"sub-specialties": [
"Property offenses",
"Personal offenses",
"Strict liability"
]
}
]
}
And I have made this lawyers table in Postgres:
DROP DATABASE IF EXISTS lawyers_db;
CREATE DATABASE lawyers_db;
\c lawyers_db;
CREATE TYPE gen AS ENUM ('f', 'm');
CREATE TABLE lawyers_tb (
ID SERIAL PRIMARY KEY,
name VARCHAR,
gender gen
);
INSERT INTO lawyers_tb (name, gender)
VALUES ('John Doe', 'm');
I'm using some node.js libraries that when I read data from Postgres table it returns the data as a JSON, so I would like to keep the relational model without using JSONb to store as a document my lawyers.
Is it possible to achieve what I want without using JSONb type?
Forget about objects for a minute and really think through what your data are and how they relate to each other (we are after all using a relational database).
What you have here is simply a relationship.
You have lawyers and you have specialties. The relationship is that lawyers have specialties and specialties belong to lawyers (an n-to-n relationship) and the same goes for the relationship between specialties and subspecialties (n-to-n).
First, lets do the simpler structure of a 1-to-n relationship:
CREATE TABLE lawyers_tb (
ID SERIAL PRIMARY KEY,
name VARCHAR,
gender gen
);
CREATE TABLE specialties_tb (
ID SERIAL PRIMARY KEY,
name VARCHAR,
lawyer_ID INTEGER
);
CREATE TABLE subspecialties_tb (
ID SERIAL PRIMARY KEY,
name VARCHAR,
specialty_ID INTEGER
);
This works but results in duplicates because each specialty can only belong to one lawyer thus if two lawyers specialise in "Business law" you'd have to define "Business law" twice. Worse, for each specialty you also have to duplicate subspecialties.
The solution is a join table (also called a map/mapping table):
CREATE TABLE lawyers_tb (
ID SERIAL PRIMARY KEY,
name VARCHAR,
gender gen
);
CREATE TABLE lawyer_specialties_tb (
name VARCHAR,
lawyer_ID INTEGER,
specialty_ID INTEGER
);
CREATE TABLE specialties_tb (
ID SERIAL PRIMARY KEY,
name VARCHAR
);
CREATE TABLE specialty_subspecialties_tb (
name VARCHAR,
specialty_ID INTEGER,
subspecialty_ID INTEGER
);
CREATE TABLE subspecialties_tb (
ID SERIAL PRIMARY KEY,
name VARCHAR
);
This way each specialty can belong to more than one lawyer (true n-to-n relationship) and each subspecialty can belong to more than one specialty.
You can use joins to fetch the whole dataset:
SELECT lawyers_tb.name as name,
lawyers_tb.gender as gender,
specialties_tb.name as specialty,
subspecialties_tb.name as subspecialty
FROM lawyers_tb LEFT JOIN lawyer_specialties_tb
ON lawyers_tb.ID=lawyer_specialties_tb.lawyer_ID
LEFT JOIN specialties_tb
ON specialties_tb.ID=lawyer_specialties_tb.specialty_ID
LEFT JOIN specialty_subspecialties_tb
ON specialties_tb.ID=specialty_subspecialties_tb.specialty_ID
Yes, it's a bit more complicated to query but the structure allows you to maintain each dataset individually and defines the proper relationships between them.
You may also want to define the keys in the join tables as foreign keys to enforce correctness of the dataset.
Here's the issue. I have 2 tables that I am currently using in a pivot to return a single value, MAX(Date). I have been asked to return additional values associated with that particular MAX(Date). I know I can do this with an OVER PARTITION but it would require me doing about 8 or 9 LEFT JOINS to get the desired output. I was hoping there is a way to get my existing PIVOT to return these values. More specifically, let's say each MAX(Date) has a data source and we want that particular source to become part of the output. Here is a simple sample of what I am talking about:
Create table #Email
(
pk_id int not null identity primary key,
email_address varchar(50),
optin_flag bit default(0),
unsub_flag bit default(0)
)
Create table #History
(
pk_id int not null identity primary key,
email_id int not null,
Status_Cd char(2),
Status_Ds varchar(20),
Source_Cd char(3),
Source_Desc varchar(20),
Source_Dttm datetime
)
Insert into #Email
Values
('test#test.com',1,0),
('blank#blank.com',1,1)
Insert into #History
values
(1,'OP','OPT-IN','WB','WEB','1/2/2015 09:32:00'),
(1,'OP','OPT-IN','WB','WEB','1/3/2015 10:15:00'),
(1,'OP','OPT-IN','WB','WEB','1/4/2015 8:02:00'),
(2,'OP','OPT-IN','WB','WEB','2/1/2015 07:22:00'),
(2,'US','UNSUBSCRIBE','EM','EMAIL','3/2/2015 09:32:00'),
(2,'US','UNSUBSCRIBE','ESP','SERVICE PROVIDER','3/2/2015 09:55:00'),
(2,'US','UNSUBSCRIBE','WB','WEB','3/2/2015 10:15:00')
;with dates as
(
select
email_id,
[OP] as [OptIn_Dttm],
[US] as [Unsub_Dttm]
from
(
select
email_id,
status_cd,
source_dttm
from #history
) as src
pivot (min(source_dttm) for status_cd in ([OP],[US])) as piv
)
select
e.pk_id as email_id,
e.email_address,
e.optin_flag,
/*WANT TO GET THE OPTIN SOURCE HERE*/ /*<-------------*/
d.OptIn_Dttm,
e.unsub_flag,
d.Unsub_Dttm
/*WANT TO GET THE UNSUB SOURCE HERE*/ /*<-------------*/
from #Email e
left join dates d on e.pk_id = d.email_id