Optimizing Theta Joins in Spark SQL - apache-spark

I have just 2 tables wherein I need to get the records from the first table (big table 10 M rows) whose transaction date is lesser than or equal to the effective date present in the second table (small table with 1 row), and this result-set will then be consumed by downstream queries.
Table Transact:
tran_id | cust_id | tran_amt | tran_dt
1234 | XYZ | 12.55 | 10/01/2020
5678 | MNP | 25.99 | 25/02/2020
5561 | XYZ | 32.45 | 30/04/2020
9812 | STR | 10.32 | 15/08/2020
Table REF:
eff_dt |
30/07/2020 |
Hence as per logic I should get back the first 3 rows and discard the last record since it is greater than the reference date (present in the REF table)
Hence, I have used a non-equi Cartesian Join between these tables as:
select
/*+ MAPJOIN(b) */
a.tran_id,
a.cust_id,
a.tran_amt,
a.tran_dt
from transact a
inner join ref b
on a.tran_dt <= b.eff_dt
However, this sql is taking forever to complete due to the cross Join with the transact table even using Broadcast hints.
So is there any smarter way to implement the same logic which will be more efficient than this ? In other words, is it possible to optimize the Theta join in this query ?
Thanks in advance.
So I wrote something like this:

Referring from https://databricks.com/session/optimizing-apache-spark-sql-joins
Can you try Bucketing on trans_dt (Bucketed on Year/Month only). And write 2 queries to do the same work
First query, trans_dt(Year/Month) < eff_dt(Year/Month). So this could help you actively picking up buckets(rather than checking each and every record trans_dt) which is less than 2020/07.
second query, trans_dt(Year/Month) = eff_dt(Year/Month) and trans_dt(Day) <= eff_dt(Day)

Related

BigQuery - Delete rows from table partitioned by date

I have a dataset.table partioned by date (100 partition) like this :
table_name_(100) which means : table_name_20200101, table_name_20200102, table_name_20200103, ...
Exemple of table_name_20200101 :
| id | col_1 | col_2 | col_3 |
-----------------------------------------------------------------------------
| xxx | 2 | 6 | 10 |
| yyy | 1 | 60 | 29 |
| zzz | 12 | 61 | 78 |
| aaa | 18 | 56 | 80 |
I would like to delete the row ID = yyy in all the table (partioned) :
DELETE FROM `project_id.dataset_id.table_name_*`
WHERE id = 'yyy'
I got this error :
Illegal operation (write) on meta-table
project_id:dataset_id.table_name_*
Is there a way to delete rows 'yyy' in all table (partioned) ?
Thank you
Okay, some various things to call out here to ensure we're using consistent terminology.
You're talking about sharded tables, not partitioned. In a partitioned table, the data within the table is organized based on the partitioning specification. Here, you just have a series of tables named using a common prefix and a suffix based on date.
The use of the table_prefix* syntax is called a wildcard table, and DML is explicitly not allowed via wildcard tables: https://cloud.google.com/bigquery/docs/querying-wildcard-tables
The table_name_(100) is an aspect of how the BigQuery UI collapses series of like-named tables to save space in the navigation panes. It's not how the service itself references tables at all.
The way you can accomplish this is to leverage other aspects of BigQuery: The INFORMATION_SCHEMA tables and scripting functionality.
Information about what tables are in a dataset is available via the TABLES view: https://cloud.google.com/bigquery/docs/information-schema-tables
Information about scripting can be found here: https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting
Now, here's an example that combines these concepts:
DECLARE myTables ARRAY<STRING>;
DECLARE X INT64 DEFAULT 0;
DECLARE queryStr STRING;
# First, we query INFORMATION_SCHEMA to generate an array of the tables we want to process.
# This INFORMATION_SCHEMA query currently has a LIMIT clause so that if you get it wrong,
# you won't bork all the tables in the dataset in one go.
SET myTables = (
SELECT
ARRAY_AGG(t)
FROM (
SELECT
TABLE_NAME as t
FROM `my-project-id`.my_dataset.INFORMATION_SCHEMA.TABLES
WHERE
TABLE_TYPE = 'BASE TABLE' AND
STARTS_WITH(TABLE_NAME, 'table_name_')
ORDER BY TABLE_NAME
LIMIT 2
)
);
# Now, we process that array of tables using scripting's loop construct,
# one at a time.
LOOP
IF X >= ARRAY_LENGTH(myTables)
THEN LEAVE;
END IF;
# DANGER WILL ROBINSON: This mutates tables!!!
#
# The next line constructs the SQL statement we want to run for each table.
#
# In this example, we're constructing the same DML DELETE
# statement to run on each table. For safety sake, you may want to start with
# something like a SELECT query to validate your assumptions and project the
# myTables values to see what you're getting.
SET queryStr = "DELETE FROM `my-project-id`.my_dataset." || myTables[SAFE_OFFSET(X)] || " WHERE id = 'yyy'";
# Now, run the generated SQL via EXECUTE IMMEDIATE.
EXECUTE IMMEDIATE queryStr;
SET X = X + 1;
END LOOP;

Salting Technique to tackle Skew in Spark SQL

I am trying to understand Salting techniques to tackle Skew in Spark SQL. I have done some reading online and I have come up with a very rudimentary implementation of the same in Spark SQL API.
Let's assume that table1 is Skewed on cid=1:
Table 1:
cid | item
---------
1 | light
1 | cookie
1 | ketchup
1 | bottle
2 | dish
3 | cup
As shown above, cid=1 occurs more than other keys.
Table 2:
cid | vehicle
---------
1 | taxi
1 | truck
2 | cycle
3 | plane
Now my code looks like the following:
create temporary view table1_salt as
select
cid, item, concat(cid, '-', floor(rand() * 19)) as salted_key
from table1;
create temporary view table2_salt as
select
cid, vehicle, explode(array(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19)) as salted_key
from table2;
Final Query:
select a.cid, a.item, b.name
from table1_salt a
inner join table2_salt b
on a.salted_key = concat(b.cid, '-', b.salted_key);
In the above example, I have used 20 salts/splits.
Questions:
Is there any rule of thumb to choose optimal number for the splits to
be used ? For e.g. if table1 has 10 Million records, how many bins/buckets should I use ? (In this simple test example I have used 20).
As shown above, when I am creating Table2_salt, I am hardcoding the
the salts like (0, 1, 2, 3.... thru 19). Is there a better
way to implement the same functionality, but without the
hardcoding and the clutter ? (What if I want to use 100 splits!)
Since we are replicating the second table (table2) N number of times, doesn't it mean that it will degrade the Join performance ?
Note: I need to use Spark 2.4 SQL API only.
Also, kindly let me know if there are any advanced examples available on the net. Any help is appreciated.

Usage of cqlsh is similar with mysql, what's the difference?

cqlsh create table:
CREATE TABLE emp(
emp_id int PRIMARY KEY,
emp_name text,
emp_city text,
emp_sal varint,
emp_phone varint
);
insert data
INSERT INTO emp (emp_id, emp_name, emp_city,
emp_phone, emp_sal) VALUES(1,'ram', 'Hyderabad', 9848022338, 50000);
select data
SELECT * FROM emp;
emp_id | emp_city | emp_name | emp_phone | emp_sal
--------+-----------+----------+------------+---------
1 | Hyderabad | ram | 9848022338 | 50000
2 | Hyderabad | robin | 9848022339 | 40000
3 | Chennai | rahman | 9848022330 | 45000
looks just same as mysql, where is column family, column?
A column family is a container for an ordered collection of rows. Each row, in turn, is an ordered collection of columns.
A column is the basic data structure of Cassandra with three values, namely key or column name, value, and a time stamp.
so table emp is a column family?
INSERT INTO emp (emp_id, emp_name, emp_city, emp_phone, emp_sal) VALUES(1,'ram', 'Hyderabad', 9848022338, 50000); is a row which contains columns?
column here is something like emp_id=>1 or emp_name=>ram ??
In Cassandra, although the column families are defined, the columns are not. You can freely add any column to any column family at any time.
what does this mean?
I can have something like this?
emp_id | emp_city | emp_name | emp_phone | emp_sal
--------+-----------+----------+------------+---------
1 | Hyderabad | ram | 9848022338 | 50000
2 | Hyderabad | robin | 9848022339 | 40000 | asdfasd | asdfasdf
3 | Chennai | rahman | 9848022330 | 45000
A super column is a special column, therefore, it is also a key-value pair. But a super column stores a map of sub-columns.
Where is super column, how to create it?
Column family is an old name, now it's called just table.
About super column, also an old term, you have "Map" data type for example, or user defined data types for more complex structures.
About freely adding columns - in the old days, Cassandra was working with unstructured data paradigm, so you didn't had to define columns before you insert them, for now it isn't possible, since Cassandra team moved to be "structured" only (as many in the DB's industry came to conclusion that unstructured data makes more problems than effort).
Anyway, Cassandra's data representation on storage level is very different from MySQL, and indeed saves only data for the columns that aren't empty. It may look same row when you are running select from cqlsh, but it is stored and queried in very different way.
The name column family is an old term for what's now simply called a table, such as "emp" in your example. Each table contains one or many columns, such as "emp_id", "emp_name".
When saying something like being able to freely add columns any time, this would mean that you're always able to omit values for columns (will be null) or add columns using the ALTER TABLE statement.

Recreating a non-straightforward Excel 'vlookup'

I'm looking for some thoughts on how you might recreate a 'vlookup' that I currently do in excel.
I have two tables: Data contains a list of datetime values; DateConverter; contains a list of calendar dates and their associated "network dates." Imagine for a business - not every day is a workday, so if I want to calculate differences in dates, I'm most interested in the number of work days that elapsed between my two dates.
Here is what the data might look like:
Data Table DateConverter Table
================= ===================
| Datetime | | Calendar date | Netowrk date |
| ------------- | | ------------- | ------------ |
| 6-1-15 8:00a | | 6-1-15 | 1000 |
| 6-2-15 1:00p | | 6-2-15 | 1001 |
| 6-3-15 7:00a | | 6-3-15 | 1002 |
| 6-10-15 3:00p | | 6-4-15 | 1003 |
| 6-15-15 1:00p | | 6-5-15 | 1004 |
| 6-12-15 2:00a | | 6-8-15 | 1005 | // Skips the weekend
| ... | | ... | ... |
In excel, I can easily map in the network date for each date in the Datetime field with a variant of vlookup:
// Assume that Datetime values are in Column A, Calendar date values in
// Column C, Network date values in Column D - this formula fills Column B
// Headers are in row 1 - first values are in row 2
B2=OFFSET($D$1,COUNTIFS($C:$C,"<"&A2),)
The formula counts the dates that are less than the lookup value (using countifs because the values in the search array are dates, and the search value is datetime) and returns the associate network date.
Is there a way to do this in Tableau? Will it require a calculated field or can I do this with some kind of join?
Thanks in advance for the help! Let me know if there is anything I can clarify. Thanks!
If the tables are on the same data server, you have the option to use joins, which is usually the most efficient way to combine information from different tables. If the tables are on different servers or platforms, then you can't use a single query to join them.
In either case, you can use Tableau data blending, which is sort of like a client-side join of aggregated results from multiple queries. Its a pretty useful technique, but a little more complex and restricted and also usually less efficient than a server side join.
So if you have the option to have both tables on the same server, start with that. It will be simpler and likely faster.
Note if you are going to use a date as a join key, you probably want to define it is a date and not a datetime.
#alex-blakemore's response would normally be adequate, but if you can change the schema, you could simply add the network date to the DataTable. The hourly granularity should not cause excessive growth and you don't need to navigate the joining.
Then, instead of counting rows and requiring a sorted table, simply subtract the Network date from each other and add 1.

postgres join list with $ delimiter

From these tables:
select group, ids
from some.groups_and_ids;
Result:
group | group_ids
---+----
winners | 1$4
losers | 4
others | 2$3$4
and:
select id,name from some.ids_and_names;
id | name
---+----
1 | bob
2 | robert
3 | dingus
4 | norbert
How would you go about returning something like:
winners | bob, norbert
losers | norbert
others | robert, dingus, norbert
with normalized (group_name, id) as (
select group_name, unnest(string_to_array(group_ids,'$')::int[])
from groups_and_ids
)
select n.group_name, string_agg(p.name,',' order by p.name)
from normalized n
join ids_and_names p on p.id = n.id
group by n.group_name;
The first part (the common table expression) normalizes your broken table design by creating a proper view on the groups_and_ids table. The actual query then joins the ids_and_names table to the normalized version of your groups and the aggregates the names again.
Note I renamed group to group_name because group is a reserved keyword.
SQLFiddle: http://sqlfiddle.com/#!15/2205b/2
Is it possible to redesign your database? Putting all the group_ids into one column makes life hard. If your table was e.g.
group | group_id
winners | 1
winners | 4
losers | 4
etc. this would be trivially easy. As it is, the below query would do it, although I hesitated to post it, since it encourages bad database design (IMHO)!
p.s. I took the liberty of renaming some columns, because they are reserved words. You can escape them, but why make life difficult for yourself?
select group_name,array_to_string(array_agg(username),', ') -- array aggregation and make it into a string
from
(
select group_name,theids,username
from ids_and_names
inner join
(select group_name,unnest(string_to_array(group_ids,'$')) as theids -- unnest a string_to_array to get rows
from groups_and_ids) i
on i.theids = cast(id as text)) a
group by group_name

Resources