Spark SQL get lineage of parent tables - apache-spark

I have created 3 temporary tables from loading from case class. And let's call it
ABC
BCD
EFG
Then I proceed to created 3 more tables by performing join.
ABC join with BCD which gives XYZ
XYZ join with EFG which gives LMN
LMN join with ABC which gives PQR
Does spark allow me in some way to see the lineage of dependent registered temporary table?? How can I extract the information where it knows LMN depends on XYZ and PQR depends on LMN, and use it programmatically to build a lineage tree (without all the plan information).
etc:
|-PQR
|-LMN
|-XYZ
|- ABC
|-EFG
|-ABC
Thanks.

There is an open source tool you can use to visualize the lineage: https://github.com/AbsaOSS/spline
It harvests the lineages at the runtime and captures it to display as a graph.

Let's take this SQL for example, which exactly creates the table/view as you mentioned.
CREATE TABLE ABC(abc_column1 int, abc_column2 string);
CREATE TABLE BCD(bcd_column1 int, bcd_column2 string);
CREATE TABLE EFG(efg_column1 int, efg_column2 string);
create view XYZ(xyz_column1,xyz_column2)
as select abc.abc_column1, bcd.bcd_column2
from abc
left join bcd on abc.abc_column1 = bcd.bcd_column1;
create view LMN (lmn_column1,lmn_column2)
as select xyz.xyz_column1, efg.efg_column2
from xyz
left join efg on xyz.xyz_column1 = efg.efg_column1;
create view PQR (pqr_column1, pqr_column2)
as select lmn.lmn_column1, abc.abc_column2
from LMN
left join ABC on abc.abc_column1 = lmn.lmn_column1;
And this is the data lineage that you are asking:
You may try your own SQL to get the data lineage here:
https://sqlflow.gudusoft.com/#/

Related

How to join two tables, apply where on both tables , apply pagination with bookshelf node js

I have two tables mentioned below:
Reports
Id |status
1 |Active
Reports_details
Id| Cntry| State | City
1 | IN | UP | Delhi
1 | US | Texas | Salt lake
Now my requirement is
Select distinct r.Id from Reports r left join Reports_details rd on r.Id=rd.Id where r.status=‘Active’ and contains(city,’”Del*”’)
Note: using contains for full text search
Problem: How to add where clause on both tables Bookshelf Model simultaneously
and how to fetch above query data with pagination
Tried created 2 respective Models with belongs on and hasMany but issue comes when applying where on either Model, it’s not accepting where clause from both table-error:Invalid column name
Appreciate your suggestion on the work around. Thank You

Salting Technique to tackle Skew in Spark SQL

I am trying to understand Salting techniques to tackle Skew in Spark SQL. I have done some reading online and I have come up with a very rudimentary implementation of the same in Spark SQL API.
Let's assume that table1 is Skewed on cid=1:
Table 1:
cid | item
---------
1 | light
1 | cookie
1 | ketchup
1 | bottle
2 | dish
3 | cup
As shown above, cid=1 occurs more than other keys.
Table 2:
cid | vehicle
---------
1 | taxi
1 | truck
2 | cycle
3 | plane
Now my code looks like the following:
create temporary view table1_salt as
select
cid, item, concat(cid, '-', floor(rand() * 19)) as salted_key
from table1;
create temporary view table2_salt as
select
cid, vehicle, explode(array(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19)) as salted_key
from table2;
Final Query:
select a.cid, a.item, b.name
from table1_salt a
inner join table2_salt b
on a.salted_key = concat(b.cid, '-', b.salted_key);
In the above example, I have used 20 salts/splits.
Questions:
Is there any rule of thumb to choose optimal number for the splits to
be used ? For e.g. if table1 has 10 Million records, how many bins/buckets should I use ? (In this simple test example I have used 20).
As shown above, when I am creating Table2_salt, I am hardcoding the
the salts like (0, 1, 2, 3.... thru 19). Is there a better
way to implement the same functionality, but without the
hardcoding and the clutter ? (What if I want to use 100 splits!)
Since we are replicating the second table (table2) N number of times, doesn't it mean that it will degrade the Join performance ?
Note: I need to use Spark 2.4 SQL API only.
Also, kindly let me know if there are any advanced examples available on the net. Any help is appreciated.

Cross Join for calculation in Spark SQL

I have a temporary view with only 1 record/value and I want to use that value to calculate the age of the customers present in another big table (with 100M rows). I used a CROSS JOIN clause, which is resulting in a performance issue.
Is there a better approach to implement this requirement which is will perform better ? Will a broadcast hint be suitable in this scenario ? What is the recommended approach to tackle such scenarios ?
Reference table: (contains only 1 value)
create temporary view ref
as
select to_date(refdt, 'dd-MM-yyyy') as refdt --returns only 1 value
from tableA
where logtype = 'A';
Cust table (10 M rows):
custid | birthdt
A1234 | 20-03-1980
B3456 | 09-05-1985
C2356 | 15-12-1990
Query (calculate age w.r.t birthdt):
select
a.custid,
a.birthdt,
cast((datediff(b.ref_dt, a.birthdt)/365.25) as int) as age
from cust a
cross join ref b;
My question is - Is there a better approach to implement this requirement ?
Thanks
Simply use withColumn!
df.withColumn("new_col", lit("10-05-2020").cast("date"))
Inside view you are using constant value, You can simply put same value in below query without cross join.
select
a.custid,
a.birthdt,
cast((datediff(to_date('10-05-2020', 'dd-MM-yyyy'), a.birthdt)/365.25) as int) as age
from cust a;
scala> spark.sql("select * from cust").show(false)
+------+----------+
|custid|birthdt |
+------+----------+
|A1234 |1980-03-20|
|B3456 |1985-05-09|
|C2356 |1990-12-15|
+------+----------+
scala> spark.sql("select a.custid, a.birthdt, cast((datediff(to_date('10-05-2020', 'dd-MM-yyyy'), a.birthdt)/365.25) as int) as age from cust a").show(false)
+------+----------+---+
|custid|birthdt |age|
+------+----------+---+
|A1234 |1980-03-20|40 |
|B3456 |1985-05-09|35 |
|C2356 |1990-12-15|29 |
+------+----------+---+
Hard to work out exactly your point, but if you cannot use Scala or pyspark and dataframes with .cache etc. then I think that instead of of using a temporary view, just create a single row table. My impression is you are using Spark %sql in a notebook on, say, Databricks.
This is my suspicion as it were.
That said a broadcastjoin hint may well mean the optimizer only sends out 1 row. See https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-hint-framework.html#specifying-query-hints

how to I select only a specific column from a Dataset after sorting it

I have the following table:
DEST_COUNTRY_NAME ORIGIN_COUNTRY_NAME count
United States Romania 15
United States Croatia 1
United States Ireland 344
Egypt United States 15
The table is represented as a Dataset.
scala> dataDS
res187: org.apache.spark.sql.Dataset[FlightData] = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
I want to sort the table based on count column and want to see only count column. I have done it but I am doing it in 2 steps
1- I first sort to get sorted DS - dataDS.sort(col("count").desc)
2- then select on that DS- (dataDS.sort(col("count").desc)).select(col("count")).show();
The above feels like am embedded sql query to me. In sql however, I can do the same query without using an embedded query
select * from flight_data_2015 ORDER BY count ASC
Is there a better way for me to both sort and select without creating a new Dataset?
There is nothing wrong
(dataDS.sort(col("count").desc)).select(col("count")).show();
It is the right thing to do and has no negative performance implications, other than intrinsic problems of sorting as such.
Use it freely and don't worry about it anymore.

How can I merge 2 spotfire tables by a regex match?

I am working on a spotfire tool, and I am using a calculated column in my main data table to group data rows into 'families' through a regex match. For example, one row might have a 'name' of ABC1234xyz, so it would be part of the ABC family because it contains the string 'ABC'. Another rows could be something like AQRST31x2af, and belong to the QRST family. The main point is that the 'family' is decided by matching a substring in the the name, but that substring could be any length, and isn't necessarily the beginning of the name string.
Right now I am doing this by a large nested If statement with a calculated column. However, this is tedious for adding new families, and maintaining the current list of families. What I would like to do is create a table with 2 columns, the string match and the family name. Then, I would like to match from this table to determine family instead of the nested if. So, it might look like the below tables:
Match Table:
id_string | family
----------------------
ABC | ABC
QRST | QRST
SUP | Super
Main Data Table:
name | data | family
---------------------------------------
ABC1234 | 1.02342 | ABC
ABC1215 | 1.23749 | ABC
AQRST31x2af | 1.04231 | QRST
BQRST32x2ac | 1.12312 | QRST
1903xSUP | 1.51231 | Super
1204xSUP | 1.68123 | Super
If you have any suggestions, I would appreciate it.
Thanks.
#wcase6- As far as I know, you cannot add columns from one table to another based on expression. When you add columns, the value in one matching column should exactly match with the other one.
Instead, you can try the below solution on your 'Main Data Table'.
Note: This solution is based on the scenarios posted. If there are more/different scenarios, you might have to tweak the custom expressions provided.
Step 1: Add a calculated column 'ID_string' which ignores lower case letters and digits.
Trim(RXReplace([Name],"[a-z0-9]","","g"))
Step 2: Add a calculated column 'family'.
If([ID_string]="SUP","Super",If(Len([ID_string])>3,right([ID_string],4),[ID_string]))
Final Output:
Hope this helps!
As #ksp585 mentioned, it doesn't seem like Spotfire can do exactly what I want, so I have come up with a solution using IronPython. Essentially, here is what I have done:
Created a table called FAMILIES, with the columns IDString and Family, which looks like this (using the same example strings above):
IDString | Family
------------------------
ABC | ABC
SUP | Super
QRST | QRST
Created a table called NAMES, as a pivot off of my main data table, with the only column being NAME. This just creates a list of unique names (since the data table has many rows for each name):
NAME
------------------------
ABC1234
ABC1215
AQRST31x2af
BQRST32x2ac
...
Created a Text Area with a button labeled Match Families, which calls an IronPython script. That script reads the NAMES table, and the FAMILIES table, compares each name to the IDString column with a regex, and associates each name with a family from the results. Any names that don't match a single IDString get the family name 'Other'. Then, it generates a new table called NAME_FAMILY_MAP, with the columns NAME and FAMILY.
With this new table, I can then add a column back to the original data table using a left outer join from NAME_FAMILY_MAP, matching on NAME. Because NAME_FAMILY_MAP is not directly linked to the NAMES table (generated by a button press), it does not create a circular dependency.
I can then add families to the FAMILIES table using another script, or by just replacing the FAMILIES table with an updated list. It's slightly more tedious than what I was hoping, but it works, so I'm happy.

Resources