Cross reference matrix in Excel - excel

I have a two column list:
SystemA, tableA
SystemA, tableB
SystemA, tableC
SystemA, tableD
SystemB, tableA
SystemB, tableC
SystemB, tableD
SystemC, tableA
I need to generate a cross reference matrix listing tables (no dups) and which systems reference them.
Here is what it should look like:
SystemA SystemB SystemC
tableA x x x
tableB x
tableC x x
tableD x x
Is this something that can be done in Excel or do I have to write code to do it?

You need a simple pivot table. Make some search on google.
Following wizard is very simple. You have just to drag fields in the right place (table on row field and system on both column field and data field).

Related

How to do compare/subtract records

Table A having 20 records and table B showing 19 records. How to find that one record is which is missing in table B. How to do compare/subtract records of these two tables; to find that one record. Running query in Apache Superset.
The exact answer depends on which column(s) define whether two records are the same. Assuming you wanted to use some primary key column for the comparison, you could try:
SELECT a.*
FROM TableA a
WHERE NOT EXISTS (SELECT 1 FROM TableB b WHERE b.pk = a.pk);
If you wanted to use more than one column to compare records from the two tables, then you would just add logic to the exists clause, e.g. for three columns:
WHERE NOT EXISTS (SELECT 1 FROM TableB b WHERE b.col1 = a.col1 AND
b.col2 = a.col2 AND
b.col3 = a.col3)

presto: convert array to rows?

i have a table with array columns all_available_tags and used_tags.
example row1:
all_available_tags:A,B,C,D
used_tags:A,B
example row2:
all_available_tags:B,C,D,E,F
used_tags:F
I want to get distinct set of all_available_tags from all rows and do except the set with all used_tags from all rows. from example above, all_available_tags of all rows would be A,B,C,D,E,F and all used_tags would be A,B,F. the end result i am looking for is C,D,E
I think i need to somehow pivot the table but there could be 100s of different tags, so it is not practical to list out everyone of them. is there a good way to do this?
You can try:
with tags(at, ut) as
(
select "A,B,C,D", "A,B"
union all
select "B,C,D,E,F", "F"
)
select splitat
from tags
cross join unnest(split(at, ",")) as t1 splitat
except
select splitut
from tags
cross join unnest(split(ut, ",")) as t2 splitut

Join two DataFrames where the join key is different and only select some columns

What I would like to do is:
Join two DataFrames A and B using their respective id columns a_id and b_id. I want to select all columns from A and two specific columns from B
I tried something like what I put below with different quotation marks but still not working. I feel in pyspark, there should have a simple way to do this.
A_B = A.join(B, A.id == B.id).select(A.*, B.b1, B.b2)
I know you could write
A_B = sqlContext.sql("SELECT A.*, B.b1, B.b2 FROM A JOIN B ON A.a_id = B.b_id")
to do this but I would like to do it more like the pseudo code above.
Your pseudocode is basically correct. This slightly modified version would work if the id column existed in both DataFrames:
A_B = A.join(B, on="id").select("A.*", "B.b1", "B.b2")
From the docs for pyspark.sql.DataFrame.join():
If on is a string or a list of strings indicating the name of the join
column(s), the column(s) must exist on both sides, and this performs
an equi-join.
Since the keys are different, you can just use withColumn() (or withColumnRenamed()) to created a column with the same name in both DataFrames:
A_B = A.withColumn("id", col("a_id")).join(B.withColumn("id", col("b_id")), on="id")\
.select("A.*", "B.b1", "B.b2")
If your DataFrames have long complicated names, you could also use alias() to make things easier:
A_B = long_data_frame_name1.alias("A").withColumn("id", col("a_id"))\
.join(long_data_frame_name2.alias("B").withColumn("id", col("b_id")), on="id")\
.select("A.*", "B.b1", "B.b2")
Try this solution:
A_B = A.join(B,col('B.id') == col('A.id')).select([col('A.'+xx) for xx in A.columns]
+ [col('B.other1'),col('B.other2')])
The below lines in SELECT played the trick of selecting all columns from A and 2 columns from Table B.
[col('a.'+xx) for xx in a.columns] : all columns in a
[col('b.other1'),col('b.other2')] : some columns of b
I think the easier solution is just to join table A to table B with selected columns you want. here is a sample code to do this:
joined_tables = table_A.join(table_B.select('col1', 'col2', 'col3'), ['id'])
the code above join all columns from table_A and columns "col1", "col2", "col3" from table_B.

Google BigQuery nested select subquery in cross join

I have the following code:
SELECT ta.application as koekkoek, ta.ipc, ipc_count/ipc_tot as ipc_share, t3.sfields FROM (
select t1.appln_id as application, t1.ipc_subclass_symbol as ipc, count(t2.appln_id) as ipc_count, sum(ipc_count) over (PARTITION BY application) as ipc_tot
FROM temp.tls209_small t1
CROSS JOIN
(SELECT appln_id, FROM temp.tls209_small group by appln_id ) t2
where t1.appln_id = t2.appln_id
GROUP BY application, ipc
) as ta
CROSS JOIN thesis.ifris_ipc_concordance t3
WHERE ta.ipc LIKE t3.ipc+'%'
AND ta.ipc NOT LIKE t3.not_ipc+'%'
AND t3.not_appln_id NOT IN
(SELECT ipc_subclass_symbol from temp.tls209_small t5 where t5.appln_id = ta.application)
Giving the folllowing error:
Field 'ta.application' not found.
I have tried numerous notations for the field, but BigQuery doesn't seem to recognize any reference to other tables in the subquery.
The purpose of the code is as to assign new technology classifications to records based on a concordance table:
I have got two tables:
One large table with application id's, classifications and some other stuff tls209_small:
And a concordance table with some exception rules ifris_ipc_concordance:
In the end I need to assign the sfields label for each row in tls209 (300 million rows). The rules are that ipc_class_symbol+'%' from the first table should be like ipcin the second table, but not like not_ipc.
In addition, the not_appln_id value, if present, should not be associated with the same appln_id in the first table.
So a small example, say this is the input of the query:
appln_id | ipc_class_symbol
1 | A1
1 | A2
1 | A3
1 | C3
sfields | ipc | not_ipc | not_appln_id
X | A | A2 | null
Y | A | null | A3
appln_id 1 should get two times sfields X because ipc=A, not_ipc matches A1 and A3.
Y should not be assigned at all as A3 occurs in appln_id 1.
In the results, I also need the share of the ipc_class_symbol for a single application (1 for 328100001, 0.5 for 32100009 etc.)
Without the last condition (AND t3.not_appln_id NOT IN (SELECT ipc_subclass_symbol from temp.tls209_small t5 where t5.appln_id = ta.application) ) the query works fine:
Any suggestions on how to get the subquery to recognize the application id (ta.application), or other ways to introduce the last condition to the query?
I realize my explanation of the problem may not be very straightforward, so if anything is not clear please indicate so, I'll try to clarify the issues.
The query you're performing is doing an anti-join. You can re-write this as an explicit join, but it is a little verbose:
SELECT *
FROM [x.z] as z
LEFT OUTER JOIN EACH [x.y] as y ON y.appln_id = z.application
WHERE y.not_appln_id is NULL
A working solution for the problem was achieved by first generating a table my matching only the ipc_class_symbol from the first table, to the ipc column of the second, but also including the not_ipc, and not_appln_id columns from the second. In addition, a list of all ipc class labels assigned to each appln_id was added using the GROUP_CONCAT method.
Finally, with help from Pentium10, the resulting table was filtered based on the exeption rules as also discussed in this question.
In the final query, the GROUP BY and JOIN arguments needed EACH modifiers to allow the large tables to be processed:
SELECT application as appln_id, ipc as ipc_class, ipc_share, sfields as ifris_class FROM (
SELECT * FROM (
SELECT ta.application as application, ta.ipc as ipc, ipc_count/ipc_tot as ipc_share, t3.sfields as sfields, t3.ipc as yes_ipc, t3.not_ipc as not_ipc, t3.not_appln_id as exclude, t4.classes as other_classes FROM (
SELECT t1.appln_id as application, t1.ipc_class_symbol as ipc, count(t2.appln_id) as ipc_count, sum(ipc_count) over (PARTITION BY application) as ipc_tot
FROM thesis.tls209_appln_ipc t1
FULL OUTER JOIN EACH
(SELECT appln_id, FROM thesis.tls209_appln_ipc GROUP EACH BY appln_id ) t2
ON t1.appln_id = t2.appln_id
GROUP EACH BY application, ipc
) AS ta
LEFT JOIN EACH (
SELECT appln_id, GROUP_CONCAT(ipc_class_symbol) as classes FROM [thesis.tls209_appln_ipc]
GROUP EACH BY appln_id) t4
ON ta.application = t4.appln_id
CROSS JOIN thesis.ifris_ipc_concordance t3
WHERE ta.ipc CONTAINS t3.ipc
) as tx
WHERE (not ipc contains not_ipc or not_ipc is null)
AND (not other_classes contains exclude or exclude is null or other_classes is null)
)

how to join two or more tables and result set having all distinct values

I have some 20 excel files containing data. all the tables have same columns like id name age location etc..... each file has distinct data but i don't know if data in one file is again repeated in another file. so i want to join all the files and the result st should contain distinct values. please help me out with this problem as soon as possible. i want the result set to be stored in an access database.
I would recomend either linking the sheets in acces, or importing the sheets as tabels.
Then from there try to determine using a DISTINCT select from the tables/sheets the keys required, and only selecting the records as required.
In SQL, you can use JOIN or NATURAL JOIN to join tables. I would look into NATURAL JOIN since you said all tables have the same values.
After that you can use DISTINCT to get distinct values.
I'm not sure if this is what you're looking for though: your question asks about excel but you've tagged it with SQL.
If you can use all the tables in one query, you can use a union to get the distinct rows:
select id, name, age, location from Table1
union
select id, name, age, location from Table2
union
select id, name, age, location from Table3
union
...
You can insert the records directly from the result:
insert into ResultTable
select id, name, age, location from Table1
union
....
If you only can select from one table at a time, you can skip the insert of rows that are already in the table:
insert into ResultTable
select t.id, t.name, t.age, t.location from Table1 as t
left join ResultTable as r on r.id = t.id
where r.id is null
(Assuming that id is a unique field identifying the record.)
It seems the unique set of data you want is this:
SELECT T1.name, T1.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db1.xls;
].[Sheet1$] AS T1
UNION
SELECT T1.name, T1.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db2.xls;
].[Sheet1$] AS T1
...but that you then want to arbitrarily apply a sequence of integers as id (rather than using the id values from the Excel tables).
Because Access Database Engine does not support common table expressions and Excel does not support VIEWs, you will have to repeat that UNION query as derived tables (hopefully the optimizer will recognize the repeat?) e.g. using a correlated subquery to get the row number:
SELECT (
SELECT COUNT(*) + 1
FROM (
SELECT T1.name, T1.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db1.xls;
].[Sheet1$] AS T1
UNION
SELECT T1.name, T1.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db2.xls;
].[Sheet1$] AS T1
) AS DT1
WHERE DT1.name < DT2.name
) AS id,
DT2.name, DT2.loc
FROM (
SELECT T2.name, T2.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db1.xls;
].[Sheet1$] AS T2
UNION
SELECT T2.name, T2.loc
FROM [Excel 8.0;HDR=YES;IMEX=1;DATABASE=C:\db2.xls;
].[Sheet1$] AS T2
) AS DT2;
Note:
i want the result set to be stored in
an access database
Then maybe you should migrate the Excel data into a staging table in your Access database and do the data scrubbing from there. At least you could put that derived table into a VIEW :)
Join is to combine two tables by matching the values in corresponding columns. In result, you will get a merged table which consists of the first table, plus the matched rows copied from the second table. You can use DIGBD add-in for excel

Resources