How to do compare/subtract records - presto

Table A having 20 records and table B showing 19 records. How to find that one record is which is missing in table B. How to do compare/subtract records of these two tables; to find that one record. Running query in Apache Superset.

The exact answer depends on which column(s) define whether two records are the same. Assuming you wanted to use some primary key column for the comparison, you could try:
SELECT a.*
FROM TableA a
WHERE NOT EXISTS (SELECT 1 FROM TableB b WHERE b.pk = a.pk);
If you wanted to use more than one column to compare records from the two tables, then you would just add logic to the exists clause, e.g. for three columns:
WHERE NOT EXISTS (SELECT 1 FROM TableB b WHERE b.col1 = a.col1 AND
b.col2 = a.col2 AND
b.col3 = a.col3)

Related

How to avoid key column name duplication in join?

I'm trying to join two tables in spark sql. Each table has 50+ columns. Both has column id as the key.
spark.sql("select * from tbl1 join tbl2 on tbl1.id = tbl2.id")
The joined table has duplicated id column.
We can of course specify which id column to keep like below:
spark.sql("select tbl1.id, .....from tbl1 join tbl2 on tbl1.id = tbl2.id")
But since we have so many columns in both tables, I do not want to type all the other column names in the query above. (other than id column, no other duplicated column names).
what should I do? thanks.
If id is the only column name in common, you can take advantage of the USING clause:
spark.sql("select * from tbl1 join tbl2 using (id) ")
The using clause matches columns that have the same name in both tables. When using select *, the column appears only once.
Assuming, you want to preserve the "duplicates", you can try to use the internal row-id or equivalents for your help. This helped me in the past, if I had to delete exactly one of two identical rows.
select *,ctid from table;
outputs in postgresql also the internal counter id. Your before exact identical rows become different now. I don't know about spark.sql, but I assume, that you can access a similar attribute there.
val joined = spark
.sql("select * from tbl1")
.join(
spark.sql("select * from tbl2"),
Seq("id"),
"inner" // optional
)
joined should have only one id column. Tested with Spark 2.4.8

How to preserve partition after joining two tables in Athena?

I have two Athena tables 1 and 2. Table 1 is partitioned, table 2 is not. When I create table 3 from the result of joining 1 and 2 on a mutual field, the partition in table 1 isn't propagated.
I know it's possible to do CTAS queries with partitions, but that requires the partition to be an existing column.
Is there a way to keep the partition in table 1 when creating table 3, something like this:
CREATE TABLE table_3
WITH (
format='PARQUET',
partitioned_by='existing_partition_in_table_1'
) AS
SELECT table_1.field
FROM table_1
JOIN table_2
ON table_1.field = table_2.field
Figured it out five minutes later.. I just need to select the partition from table 1 as well, then the CTA statement can access the partition
CREATE TABLE table_3
WITH (
format='PARQUET',
partitioned_by='partition_name'
) AS
SELECT table_1.field, table_1.partition_name
FROM table_1
JOIN table_2
ON table_1.field = table_2.field
*facepalm

Left join in subquery

I am trying to do a left join on my main table using this code
select distinct VBen.BENF_NO_INDIV_BEN_BANLS as benbanls,
VBen.BENF_COD_SEXE AS Sexe,
VBen.BENF_DAT_NAISS AS DatNaiss,
VBen.BENF_DAT_DECES AS DatDec,
A.date_ch as date_chsld
from PROD.V_FICH_ID_BEN_CM AS VBen
left join (select distinct VAss.BENF_NO_INDIV_BEN_BANLS as benbanls,
vass.BENF_DD_ADMIS_ASSU_MED as date_ch
from Prod.V_ADMIS_ASSU_MED_PLAN_PRIOR_CM as vass ) as A
on VBen.BENF_NO_INDIV_BEN_BANLS =A. benbanls
where Vben.BENF_DAT_NAISS>'2016-04-01' or Vben.BENF_DAT_DECES>'2011-04-01'
The problem is that the query result is a table with of number of rows greater than the main table with the same where 'condition'. I don't understand what I am missing
Thanks for your help
Why is it a problem?
The results simply indicate you have a 1:M (one to many) relationship between VBen:Vass(A)
If you don't have a 1:M relationship and it should be 1:1 then...
you're missing join criteria between the tables.
you should be getting a min/max on your date instead of all dates per benbanls
To better understand and answer we would need to know what VBen and Vass actually represent; but to put simply, you have multiple VASS(A) per VBEN
To illustrate with an example: Think about Order_Header and Order_Line tables...
Order_header contains (order_Number PK)
Order_line contains (Order_Number, Order_Line PK)
An order can have multiple lines, each line could have it's own ship date several items may have gone out on the same shipment/day. where some that were backordered went out on a different day. In this situation, an order would still have multiple lines even though we distinct order_number and shipmentdate in a subquery. I would guess your situation is similar.
so 1 in base table * 2 rows in derived/lines table gives us 2 records
1 < 2 which is the situation you have now; and that to me is perfectly fine and expected if it's a 1:M relationship.
Maybe you need to do a min or max on date instead of a distinct?
If not you're missing join criteria to make a 1:1 relationship
maybe your expectation is just flawed.
The below will give you a 1:1 relationship but I'm not sure it's what you're after.
SELECT distinct VBen.BENF_NO_INDIV_BEN_BANLS as benbanls,
VBen.BENF_COD_SEXE AS Sexe,
VBen.BENF_DAT_NAISS AS DatNaiss,
VBen.BENF_DAT_DECES AS DatDec,
A.date_ch as date_chsld
FROM PROD.V_FICH_ID_BEN_CM AS VBen
LEFT JOIN (SELECT VAss.BENF_NO_INDIV_BEN_BANLS as benbanls,
Max(vass.BENF_DD_ADMIS_ASSU_MED) as date_ch
FROM Prod.V_ADMIS_ASSU_MED_PLAN_PRIOR_CM as vass
GROUP BY VAss.BENF_NO_INDIV_BEN_BANLS) as A
on VBen.BENF_NO_INDIV_BEN_BANLS = A. benbanls
WHERE (Vben.BENF_DAT_NAISS>'2016-04-01'
or Vben.BENF_DAT_DECES>'2011-04-01)
It is likely that there is more than one counterpart in the detail table of a record on the main table.
I try your scenario on my db get a correct result.
In my DB:
select distinct p.PollId as PollId,
p.Title AS Title,
p.InsertDate AS DatDec,
ps.date_ch as date_chsld
from dbo.Poll AS p
left join (select distinct pSt.PollId as pollId,
Max(pSt.InsertDate) as date_ch
from dbo.PollStore as pSt
Group by pSt.PollId ) as ps
on p.PollId =ps.pollId
As Your Query like this :
select distinct VBen.BENF_NO_INDIV_BEN_BANLS as benbanls,
VBen.BENF_COD_SEXE AS Sexe,
VBen.BENF_DAT_NAISS AS DatNaiss,
VBen.BENF_DAT_DECES AS DatDec,
A.date_ch as date_chsld
please try this query
from PROD.V_FICH_ID_BEN_CM AS VBen
left join (select distinct VAss.BENF_NO_INDIV_BEN_BANLS as benbanls,
Max(vass.BENF_DD_ADMIS_ASSU_MED) as date_ch
from Prod.V_ADMIS_ASSU_MED_PLAN_PRIOR_CM Group by VAss.BENF_NO_INDIV_BEN_BANLS as vass ) as A
on VBen.BENF_NO_INDIV_BEN_BANLS =A. benbanls
where Vben.BENF_DAT_NAISS>'2016-04-01' or Vben.BENF_DAT_DECES>'2011-04-01'

Doctrine2 subquery

I'm trying to write a subquery in doctrine2, to sort a table ordered by date column of another table.
(let's say I'm querying table A, which has an id column, and B has an a_id and a date, in the subquery b.a_id = a.id)
I'm using query builder, and the addSelect method, but since I can't use LIMIT in my query I get this error:
SQLSTATE[21000]: Cardinality violation: 1242 Subquery returns more
than 1 row
This error is true, but how could I limit this query to 1 row, if LIMIT is not allowed in doctrine2 and when I try to do it by querybuilder (I mean the subquery) and I'm using setMaxResults, and then getDQl it is still not working.
->addSelect('(SELECT b.date FROM B b WHERE b.conversation = a.id ORDER BY b.date DESC)')
Is there any solution for my problem?
Thanks
Make the query return exactly one row. Try SELECT MAX(b.date) FROM B b WHERE b.conversation = a.id)

how to join two or more tables and result set having all distinct values

I have some 20 excel files containing data. all the tables have same columns like id name age location etc..... each file has distinct data but i don't know if data in one file is again repeated in another file. so i want to join all the files and the result st should contain distinct values. please help me out with this problem as soon as possible. i want the result set to be stored in an access database.
I would recomend either linking the sheets in acces, or importing the sheets as tabels.
Then from there try to determine using a DISTINCT select from the tables/sheets the keys required, and only selecting the records as required.
In SQL, you can use JOIN or NATURAL JOIN to join tables. I would look into NATURAL JOIN since you said all tables have the same values.
After that you can use DISTINCT to get distinct values.
I'm not sure if this is what you're looking for though: your question asks about excel but you've tagged it with SQL.
If you can use all the tables in one query, you can use a union to get the distinct rows:
select id, name, age, location from Table1
union
select id, name, age, location from Table2
union
select id, name, age, location from Table3
union
...
You can insert the records directly from the result:
insert into ResultTable
select id, name, age, location from Table1
union
....
If you only can select from one table at a time, you can skip the insert of rows that are already in the table:
insert into ResultTable
select t.id, t.name, t.age, t.location from Table1 as t
left join ResultTable as r on r.id = t.id
where r.id is null
(Assuming that id is a unique field identifying the record.)
It seems the unique set of data you want is this:
SELECT T1.name, T1.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db1.xls;
].[Sheet1$] AS T1
UNION
SELECT T1.name, T1.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db2.xls;
].[Sheet1$] AS T1
...but that you then want to arbitrarily apply a sequence of integers as id (rather than using the id values from the Excel tables).
Because Access Database Engine does not support common table expressions and Excel does not support VIEWs, you will have to repeat that UNION query as derived tables (hopefully the optimizer will recognize the repeat?) e.g. using a correlated subquery to get the row number:
SELECT (
SELECT COUNT(*) + 1
FROM (
SELECT T1.name, T1.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db1.xls;
].[Sheet1$] AS T1
UNION
SELECT T1.name, T1.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db2.xls;
].[Sheet1$] AS T1
) AS DT1
WHERE DT1.name < DT2.name
) AS id,
DT2.name, DT2.loc
FROM (
SELECT T2.name, T2.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db1.xls;
].[Sheet1$] AS T2
UNION
SELECT T2.name, T2.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db2.xls;
].[Sheet1$] AS T2
) AS DT2;
Note:
i want the result set to be stored in
an access database
Then maybe you should migrate the Excel data into a staging table in your Access database and do the data scrubbing from there. At least you could put that derived table into a VIEW :)
Join is to combine two tables by matching the values in corresponding columns. In result, you will get a merged table which consists of the first table, plus the matched rows copied from the second table. You can use DIGBD add-in for excel

Resources