presto: convert array to rows? - pivot

i have a table with array columns all_available_tags and used_tags.
example row1:
all_available_tags:A,B,C,D
used_tags:A,B
example row2:
all_available_tags:B,C,D,E,F
used_tags:F
I want to get distinct set of all_available_tags from all rows and do except the set with all used_tags from all rows. from example above, all_available_tags of all rows would be A,B,C,D,E,F and all used_tags would be A,B,F. the end result i am looking for is C,D,E
I think i need to somehow pivot the table but there could be 100s of different tags, so it is not practical to list out everyone of them. is there a good way to do this?

You can try:
with tags(at, ut) as
(
select "A,B,C,D", "A,B"
union all
select "B,C,D,E,F", "F"
)
select splitat
from tags
cross join unnest(split(at, ",")) as t1 splitat
except
select splitut
from tags
cross join unnest(split(ut, ",")) as t2 splitut

Related

Selecting a column not in cube in Spark

I have a dataframe which has say 3 columns x,y and z.
I want to get all the three columns in result but I do not want to cube on column z.
Is there a way I can do it?
P.S. - (I have just given example with 3 columns but I have quite a long list of columns so GROUP SET is not an option).
Example -
val df = Seq(("1","x","a"),("1","v","b"),("3","x","c")).toDF("col1","col2","col3")
val list = Seq("col1","col2").map(e=>col(e))
// now I want to select col3 non cubed (basically I do not want get the combinations for it)
// This guy will not select col3 at all since col3 is not part of cube which is I want to achieve
display(df.select($"col1",$"col2",$"col3").cube(list:_*).agg(sum("col1")))
Cube is an extension of GroupBY in which you will get the aggregated result for the various combinations of columns used to group by.
Here is an example of what you can achieve using groupBy,
df.cube($"col1",$"col2").agg(first($"col3").as("col3")).show
Please share your expected result as suggested by Shaido.

Join two DataFrames where the join key is different and only select some columns

What I would like to do is:
Join two DataFrames A and B using their respective id columns a_id and b_id. I want to select all columns from A and two specific columns from B
I tried something like what I put below with different quotation marks but still not working. I feel in pyspark, there should have a simple way to do this.
A_B = A.join(B, A.id == B.id).select(A.*, B.b1, B.b2)
I know you could write
A_B = sqlContext.sql("SELECT A.*, B.b1, B.b2 FROM A JOIN B ON A.a_id = B.b_id")
to do this but I would like to do it more like the pseudo code above.
Your pseudocode is basically correct. This slightly modified version would work if the id column existed in both DataFrames:
A_B = A.join(B, on="id").select("A.*", "B.b1", "B.b2")
From the docs for pyspark.sql.DataFrame.join():
If on is a string or a list of strings indicating the name of the join
column(s), the column(s) must exist on both sides, and this performs
an equi-join.
Since the keys are different, you can just use withColumn() (or withColumnRenamed()) to created a column with the same name in both DataFrames:
A_B = A.withColumn("id", col("a_id")).join(B.withColumn("id", col("b_id")), on="id")\
.select("A.*", "B.b1", "B.b2")
If your DataFrames have long complicated names, you could also use alias() to make things easier:
A_B = long_data_frame_name1.alias("A").withColumn("id", col("a_id"))\
.join(long_data_frame_name2.alias("B").withColumn("id", col("b_id")), on="id")\
.select("A.*", "B.b1", "B.b2")
Try this solution:
A_B = A.join(B,col('B.id') == col('A.id')).select([col('A.'+xx) for xx in A.columns]
+ [col('B.other1'),col('B.other2')])
The below lines in SELECT played the trick of selecting all columns from A and 2 columns from Table B.
[col('a.'+xx) for xx in a.columns] : all columns in a
[col('b.other1'),col('b.other2')] : some columns of b
I think the easier solution is just to join table A to table B with selected columns you want. here is a sample code to do this:
joined_tables = table_A.join(table_B.select('col1', 'col2', 'col3'), ['id'])
the code above join all columns from table_A and columns "col1", "col2", "col3" from table_B.

Break ties in RANKX Powerpivot formula

I can rank my data with this formula, which groups by Year, Trust and ID, and ranks the Areas.
rankx(
filter(Table,
[Year]=earlier([Year])&&[Trust]=earlier([Trust])&&[ID]=earlier([ID])),
[Area], ,1,Dense)
This works fine - unless you have data where the same Area appears more than once in the same group, whereupon it gives all rows the rank of 1. Is there any way to force unique rank values? So two rows that have the same Area would be given the rank of 1 and 2 (in an arbitrary order)? Thank you for your time.
Assuming you don't have duplicate rows in your table, you can add another column as a tie-breaker in your expression.
Suppose your table has an additional column, [Name], that is distinct between your multiple [Area] rows. Then you could write your formula like this:
= RANKX(
FILTER(Table,
[Year] = EARLIER([Year]) &&
[Trust] = EARLIER([Trust]) &&
[ID] = EARLIER([ID])),
[Area] & [Name], , 1, Dense)
You can append as many columns as you need to get the tie-breaking done.

Cross reference matrix in Excel

I have a two column list:
SystemA, tableA
SystemA, tableB
SystemA, tableC
SystemA, tableD
SystemB, tableA
SystemB, tableC
SystemB, tableD
SystemC, tableA
I need to generate a cross reference matrix listing tables (no dups) and which systems reference them.
Here is what it should look like:
SystemA SystemB SystemC
tableA x x x
tableB x
tableC x x
tableD x x
Is this something that can be done in Excel or do I have to write code to do it?
You need a simple pivot table. Make some search on google.
Following wizard is very simple. You have just to drag fields in the right place (table on row field and system on both column field and data field).

how to join two or more tables and result set having all distinct values

I have some 20 excel files containing data. all the tables have same columns like id name age location etc..... each file has distinct data but i don't know if data in one file is again repeated in another file. so i want to join all the files and the result st should contain distinct values. please help me out with this problem as soon as possible. i want the result set to be stored in an access database.
I would recomend either linking the sheets in acces, or importing the sheets as tabels.
Then from there try to determine using a DISTINCT select from the tables/sheets the keys required, and only selecting the records as required.
In SQL, you can use JOIN or NATURAL JOIN to join tables. I would look into NATURAL JOIN since you said all tables have the same values.
After that you can use DISTINCT to get distinct values.
I'm not sure if this is what you're looking for though: your question asks about excel but you've tagged it with SQL.
If you can use all the tables in one query, you can use a union to get the distinct rows:
select id, name, age, location from Table1
union
select id, name, age, location from Table2
union
select id, name, age, location from Table3
union
...
You can insert the records directly from the result:
insert into ResultTable
select id, name, age, location from Table1
union
....
If you only can select from one table at a time, you can skip the insert of rows that are already in the table:
insert into ResultTable
select t.id, t.name, t.age, t.location from Table1 as t
left join ResultTable as r on r.id = t.id
where r.id is null
(Assuming that id is a unique field identifying the record.)
It seems the unique set of data you want is this:
SELECT T1.name, T1.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db1.xls;
].[Sheet1$] AS T1
UNION
SELECT T1.name, T1.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db2.xls;
].[Sheet1$] AS T1
...but that you then want to arbitrarily apply a sequence of integers as id (rather than using the id values from the Excel tables).
Because Access Database Engine does not support common table expressions and Excel does not support VIEWs, you will have to repeat that UNION query as derived tables (hopefully the optimizer will recognize the repeat?) e.g. using a correlated subquery to get the row number:
SELECT (
SELECT COUNT(*) + 1
FROM (
SELECT T1.name, T1.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db1.xls;
].[Sheet1$] AS T1
UNION
SELECT T1.name, T1.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db2.xls;
].[Sheet1$] AS T1
) AS DT1
WHERE DT1.name < DT2.name
) AS id,
DT2.name, DT2.loc
FROM (
SELECT T2.name, T2.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db1.xls;
].[Sheet1$] AS T2
UNION
SELECT T2.name, T2.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db2.xls;
].[Sheet1$] AS T2
) AS DT2;
Note:
i want the result set to be stored in
an access database
Then maybe you should migrate the Excel data into a staging table in your Access database and do the data scrubbing from there. At least you could put that derived table into a VIEW :)
Join is to combine two tables by matching the values in corresponding columns. In result, you will get a merged table which consists of the first table, plus the matched rows copied from the second table. You can use DIGBD add-in for excel

Resources