How to preserve partition after joining two tables in Athena? - partition

I have two Athena tables 1 and 2. Table 1 is partitioned, table 2 is not. When I create table 3 from the result of joining 1 and 2 on a mutual field, the partition in table 1 isn't propagated.
I know it's possible to do CTAS queries with partitions, but that requires the partition to be an existing column.
Is there a way to keep the partition in table 1 when creating table 3, something like this:
CREATE TABLE table_3
WITH (
format='PARQUET',
partitioned_by='existing_partition_in_table_1'
) AS
SELECT table_1.field
FROM table_1
JOIN table_2
ON table_1.field = table_2.field

Figured it out five minutes later.. I just need to select the partition from table 1 as well, then the CTA statement can access the partition
CREATE TABLE table_3
WITH (
format='PARQUET',
partitioned_by='partition_name'
) AS
SELECT table_1.field, table_1.partition_name
FROM table_1
JOIN table_2
ON table_1.field = table_2.field
*facepalm

Related

How to avoid key column name duplication in join?

I'm trying to join two tables in spark sql. Each table has 50+ columns. Both has column id as the key.
spark.sql("select * from tbl1 join tbl2 on tbl1.id = tbl2.id")
The joined table has duplicated id column.
We can of course specify which id column to keep like below:
spark.sql("select tbl1.id, .....from tbl1 join tbl2 on tbl1.id = tbl2.id")
But since we have so many columns in both tables, I do not want to type all the other column names in the query above. (other than id column, no other duplicated column names).
what should I do? thanks.
If id is the only column name in common, you can take advantage of the USING clause:
spark.sql("select * from tbl1 join tbl2 using (id) ")
The using clause matches columns that have the same name in both tables. When using select *, the column appears only once.
Assuming, you want to preserve the "duplicates", you can try to use the internal row-id or equivalents for your help. This helped me in the past, if I had to delete exactly one of two identical rows.
select *,ctid from table;
outputs in postgresql also the internal counter id. Your before exact identical rows become different now. I don't know about spark.sql, but I assume, that you can access a similar attribute there.
val joined = spark
.sql("select * from tbl1")
.join(
spark.sql("select * from tbl2"),
Seq("id"),
"inner" // optional
)
joined should have only one id column. Tested with Spark 2.4.8

How to do compare/subtract records

Table A having 20 records and table B showing 19 records. How to find that one record is which is missing in table B. How to do compare/subtract records of these two tables; to find that one record. Running query in Apache Superset.
The exact answer depends on which column(s) define whether two records are the same. Assuming you wanted to use some primary key column for the comparison, you could try:
SELECT a.*
FROM TableA a
WHERE NOT EXISTS (SELECT 1 FROM TableB b WHERE b.pk = a.pk);
If you wanted to use more than one column to compare records from the two tables, then you would just add logic to the exists clause, e.g. for three columns:
WHERE NOT EXISTS (SELECT 1 FROM TableB b WHERE b.col1 = a.col1 AND
b.col2 = a.col2 AND
b.col3 = a.col3)

Insert selective columns to hive

I want to insert selective columns to Hive and I am unable to do so. This is what I was trying via spark
val df2 = spark.sql("SELECT Device_Version,date, SUM(size) as size FROM table1 WHERE date='2019-06-13' GROUP BY date, Device_Version")
df2.createOrReplaceTempView("tempTable")
spark.sql("Insert into table2 PARTITION (date,ID) (Device_Version) SELECT Device_Version, date, '1' AS ID FROM tempTable")
My aim is to only insert selective fields to the table t2. Table t2 has many other columns which I want to be padded as null. I can do the padding as long as I can specify the order. I do not want the order to be taken by default.
Something like ...
spark.sql("Insert into table2 PARTITION (date,cuboid_id) (Device_Version,OS) SELECT Device_Version, null as os, date, '10001' AS CUBOID_ID FROM tempTable")
Is there any way to do this ? Any options are welcome.

How to optimize a join?

I have a query to join the tables. How do I optimize to run it faster?
val q = """
| select a.value as viewedid,b.other as otherids
| from bm.distinct_viewed_2610 a, bm.tets_2610 b
| where FIND_IN_SET(a.value, b.other) != 0 and a.value in (
| select value from bm.distinct_viewed_2610)
|""".stripMargin
val rows = hiveCtx.sql(q).repartition(100)
Table descriptions:
hive> desc distinct_viewed_2610;
OK
value string
hive> desc tets_2610;
OK
id int
other string
the data looks like this:
hive> select * from distinct_viewed_2610 limit 5;
OK
1033346511
1033419148
1033641547
1033663265
1033830989
and
hive> select * from tets_2610 limit 2;
OK
1033759023
103973207,1013425393,1013812066,1014099507,1014295173,1014432476,1014620707,1014710175,1014776981,1014817307,1023740250,1031023907,1031188043,1031445197
distinct_viewed_2610 table has 1.1 million records and i am trying to get similar id's for that from table tets_2610 which has 200 000 rows by splitting second column.
for 100 000 records it is taking 8.5 hrs to complete the job with two machines
one with 16 gb ram and 16 cores
second with 8 gb ram and 8 cores.
Is there a way to optimize the query?
Now you are doing cartesian join. Cartesian join gives you 1.1M*200K = 220 billion rows. After Cartesian join it filtered by where FIND_IN_SET(a.value, b.other) != 0
Analyze your data.
If 'other' string contains 10 elements in average then exploding it will give you 2.2M rows in table b. And if suppose only 1/10 of rows will join then you will have 2.2M/10=220K rows because of INNER JOIN.
If these assumptions are correct then exploding array and join will perform better than Cartesian join+filter.
select distinct a.value as viewedid, b.otherids
from bm.distinct_viewed_2610 a
inner join (select e.otherid, b.other as otherids
from bm.tets_2610 b
lateral view explode (split(b.other ,',')) e as otherid
)b on a.value=b.otherid
And you do not need this :
and a.value in (select value from bm.distinct_viewed_2610)
Sorry I cannot test query, do it yourself please.
If you are using orc formate change to parquet as per your data i woud say choose range partition.
Choose proper parallization to execute fast.
I have answred on follwing link may be help you.
Spark doing exchange of partitions already correctly distributed
Also please read it
http://dev.sortable.com/spark-repartition/

Cassandra slow SELECT MAX(x) query

I have a dev machine with Cassandra 3.9 and 2 tables, one has ~~ 400,000 records, another about 40,000,000 records. Their structures are different.
Each of them has a secondary index on a field x, and I'm trying to run a query of the form SELECT MAX(x) FROM table. On the first table, the query takes a couple of seconds, and on the second table, it times out.
My experience is with relational databases where these queries are trivial and fast. So in Cassandra, it looks like the index isn't used to execute these queries? Is there an alternative?
In cassandra aggregation functions such as MIN, MAX, COUNT, SUM or AVG on a table without specifing a partition key is a bad practice. instead, you can have an other table that store the max value of x field for both tables.
However, you have to add some client side logic to maintain this max value in the other table when you run INSERT or UPDATE statements.
Tables structures :
CREATE TABLE t1 (
pk text PRIMARY KEY,
x int
);
CREATE TABLE t2 (
pk text PRIMARY KEY,
x int
);
CREATE TABLE agg_table (
table_name text PRIMARY KEY,
max_value int
);
So with this structure you can have the max value for a table :
SELECT max_value
FROM agg_table
WHERE table_name = 't1';
Hope this can help you.

Resources