How to avoid key column name duplication in join?

How to avoid key column name duplication in join? - apache-spark

I'm trying to join two tables in spark sql. Each table has 50+ columns. Both has column id as the key.
spark.sql("select * from tbl1 join tbl2 on tbl1.id = tbl2.id")
The joined table has duplicated id column.
We can of course specify which id column to keep like below:
spark.sql("select tbl1.id, .....from tbl1 join tbl2 on tbl1.id = tbl2.id")
But since we have so many columns in both tables, I do not want to type all the other column names in the query above. (other than id column, no other duplicated column names).
what should I do? thanks.

If id is the only column name in common, you can take advantage of the USING clause:
spark.sql("select * from tbl1 join tbl2 using (id) ")
The using clause matches columns that have the same name in both tables. When using select *, the column appears only once.

Assuming, you want to preserve the "duplicates", you can try to use the internal row-id or equivalents for your help. This helped me in the past, if I had to delete exactly one of two identical rows.
select *,ctid from table;
outputs in postgresql also the internal counter id. Your before exact identical rows become different now. I don't know about spark.sql, but I assume, that you can access a similar attribute there.

val joined = spark
.sql("select * from tbl1")
.join(
spark.sql("select * from tbl2"),
Seq("id"),
"inner" // optional
)
joined should have only one id column. Tested with Spark 2.4.8

Related

Is there a way to limit the data being read when joining tables in spark sql?

I want to read the data in spark sql by joining two very large tables. But i just need a fix number (let's say 500) from the resultant dataframe.
For example -
SELECT id, name, employee.deptno, deptname
FROM employee INNER JOIN department ON employee.deptno = department.deptno
Here i can use head(500) or limit(500) function on the resultant dataframe to limit the rows from resultant dataframe, But still it is going to read full data from both of the tables first and then on the resultant dataframe it will apply the limit.
Is there a way in which i can avoid reading full data before applying the limit ?

Something like this:
employee = spark.sql('select id, name, deptno from employee limit 500')
department = spark.sql('select deptno, deptname from department limit 500')
employee = employee.join(department, on = 'deptno', how = 'inner')

How to do compare/subtract records

Table A having 20 records and table B showing 19 records. How to find that one record is which is missing in table B. How to do compare/subtract records of these two tables; to find that one record. Running query in Apache Superset.

The exact answer depends on which column(s) define whether two records are the same. Assuming you wanted to use some primary key column for the comparison, you could try:
SELECT a.*
FROM TableA a
WHERE NOT EXISTS (SELECT 1 FROM TableB b WHERE b.pk = a.pk);
If you wanted to use more than one column to compare records from the two tables, then you would just add logic to the exists clause, e.g. for three columns:
WHERE NOT EXISTS (SELECT 1 FROM TableB b WHERE b.col1 = a.col1 AND
b.col2 = a.col2 AND
b.col3 = a.col3)

Hive query for row number

I am working on pyspark, need to write a query which reads data from hive table and returns a pyspark dataframe containing all the columns and row number.
This is what I tried :
SELECT *, ROW_NUMBER() OVER () as rcd_num FROM schema_name.table_name
This query works fine in hive, but when I run it from a pyspark script it throws the following error:
Window function row_number() requires window to be ordered, please add ORDER BY clause. For example SELECT row_number()(value_expr) OVER (PARTITION BY window_partition ORDER BY window_ordering) from table;
Please suggest some solution.
Note: I do not wish to order the rows in any particular order, I just need row numbers for all the rows present in the table without any sorting or ordering.
Using spark 2.1

ROW_NUMBER()might be required ordering so you can used monotonicallyIncreasingId function which gives you row numbers for all the rows present in the table.
from pyspark.sql.functions import monotonicallyIncreasingId
df.withColumn("rcd_num ", monotonicallyIncreasingId())
OR
SELECT *, ROW_NUMBER() OVER (Order by (select NULL)) as rcd_num FROM schema_name.table_name
you can set order by select NULL

Spark sql dataframe drop all columns from alias table after join

I am trying to join two dataframes with the same column names and compute some new values. after that i need to drop all columns of second table. The number of columns is huge. How can I do it in easier way? I tried to .drop("table2.*"),but this dont work.

You can use select with aliases:
df1.alias("df1")
.join(df2.alias("df2"), Seq("someJoinColumn"))
.select($"df1.*", $"someComputedColumn", ...)
reference with the parent DataFrame:
df1.join(df2, Seq("someJoinColumn")).select(df1("*"), $"someComputedColumn", ...)

Instead of dropping, you can select all the necessary columns that you want hold for further operations something like below
val newDataFrame = joinedDataFrame.select($"col1", $"col4", $"col6")

how to join two or more tables and result set having all distinct values

I have some 20 excel files containing data. all the tables have same columns like id name age location etc..... each file has distinct data but i don't know if data in one file is again repeated in another file. so i want to join all the files and the result st should contain distinct values. please help me out with this problem as soon as possible. i want the result set to be stored in an access database.

I would recomend either linking the sheets in acces, or importing the sheets as tabels.
Then from there try to determine using a DISTINCT select from the tables/sheets the keys required, and only selecting the records as required.

In SQL, you can use JOIN or NATURAL JOIN to join tables. I would look into NATURAL JOIN since you said all tables have the same values.
After that you can use DISTINCT to get distinct values.
I'm not sure if this is what you're looking for though: your question asks about excel but you've tagged it with SQL.

If you can use all the tables in one query, you can use a union to get the distinct rows:
select id, name, age, location from Table1
union
select id, name, age, location from Table2
union
select id, name, age, location from Table3
union
...
You can insert the records directly from the result:
insert into ResultTable
select id, name, age, location from Table1
union
....
If you only can select from one table at a time, you can skip the insert of rows that are already in the table:
insert into ResultTable
select t.id, t.name, t.age, t.location from Table1 as t
left join ResultTable as r on r.id = t.id
where r.id is null
(Assuming that id is a unique field identifying the record.)

It seems the unique set of data you want is this:
SELECT T1.name, T1.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db1.xls;
].[Sheet1$] AS T1
UNION
SELECT T1.name, T1.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db2.xls;
].[Sheet1$] AS T1
...but that you then want to arbitrarily apply a sequence of integers as id (rather than using the id values from the Excel tables).
Because Access Database Engine does not support common table expressions and Excel does not support VIEWs, you will have to repeat that UNION query as derived tables (hopefully the optimizer will recognize the repeat?) e.g. using a correlated subquery to get the row number:
SELECT (
SELECT COUNT(*) + 1
FROM (
SELECT T1.name, T1.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db1.xls;
].[Sheet1$] AS T1
UNION
SELECT T1.name, T1.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db2.xls;
].[Sheet1$] AS T1
) AS DT1
WHERE DT1.name < DT2.name
) AS id,
DT2.name, DT2.loc
FROM (
SELECT T2.name, T2.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db1.xls;
].[Sheet1$] AS T2
UNION
SELECT T2.name, T2.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db2.xls;
].[Sheet1$] AS T2
) AS DT2;
Note:
i want the result set to be stored in
an access database
Then maybe you should migrate the Excel data into a staging table in your Access database and do the data scrubbing from there. At least you could put that derived table into a VIEW :)

Join is to combine two tables by matching the values in corresponding columns. In result, you will get a merged table which consists of the first table, plus the matched rows copied from the second table. You can use DIGBD add-in for excel

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to avoid key column name duplication in join? - apache-spark

If id is the only column name in common, you can take advantage of the USING clause: spark.sql("select * from tbl1 join tbl2 using (id) ") The using clause matches columns that have the same name in both tables. When using select *, the column appears only once.

val joined = spark .sql("select * from tbl1") .join( spark.sql("select * from tbl2"), Seq("id"), "inner" // optional ) joined should have only one id column. Tested with Spark 2.4.8

Related

Is there a way to limit the data being read when joining tables in spark sql?

How to do compare/subtract records

Hive query for row number

Spark sql dataframe drop all columns from alias table after join

how to join two or more tables and result set having all distinct values

Categories

Resources