Salting Technique to tackle Skew in Spark SQL

Salting Technique to tackle Skew in Spark SQL - apache-spark

I am trying to understand Salting techniques to tackle Skew in Spark SQL. I have done some reading online and I have come up with a very rudimentary implementation of the same in Spark SQL API.
Let's assume that table1 is Skewed on cid=1:
Table 1:
cid | item
---------
1 | light
1 | cookie
1 | ketchup
1 | bottle
2 | dish
3 | cup
As shown above, cid=1 occurs more than other keys.
Table 2:
cid | vehicle
---------
1 | taxi
1 | truck
2 | cycle
3 | plane
Now my code looks like the following:
create temporary view table1_salt as
select
cid, item, concat(cid, '-', floor(rand() * 19)) as salted_key
from table1;
create temporary view table2_salt as
select
cid, vehicle, explode(array(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19)) as salted_key
from table2;
Final Query:
select a.cid, a.item, b.name
from table1_salt a
inner join table2_salt b
on a.salted_key = concat(b.cid, '-', b.salted_key);
In the above example, I have used 20 salts/splits.
Questions:
Is there any rule of thumb to choose optimal number for the splits to
be used ? For e.g. if table1 has 10 Million records, how many bins/buckets should I use ? (In this simple test example I have used 20).
As shown above, when I am creating Table2_salt, I am hardcoding the
the salts like (0, 1, 2, 3.... thru 19). Is there a better
way to implement the same functionality, but without the
hardcoding and the clutter ? (What if I want to use 100 splits!)
Since we are replicating the second table (table2) N number of times, doesn't it mean that it will degrade the Join performance ?
Note: I need to use Spark 2.4 SQL API only.
Also, kindly let me know if there are any advanced examples available on the net. Any help is appreciated.

Related

How to join two tables, apply where on both tables , apply pagination with bookshelf node js

I have two tables mentioned below:
Reports
Id |status
1 |Active
Reports_details
Id| Cntry| State | City
1 | IN | UP | Delhi
1 | US | Texas | Salt lake
Now my requirement is
Select distinct r.Id from Reports r left join Reports_details rd on r.Id=rd.Id where r.status=‘Active’ and contains(city,’”Del*”’)
Note: using contains for full text search
Problem: How to add where clause on both tables Bookshelf Model simultaneously
and how to fetch above query data with pagination
Tried created 2 respective Models with belongs on and hasMany but issue comes when applying where on either Model, it’s not accepting where clause from both table-error:Invalid column name
Appreciate your suggestion on the work around. Thank You

Optimizing Theta Joins in Spark SQL

I have just 2 tables wherein I need to get the records from the first table (big table 10 M rows) whose transaction date is lesser than or equal to the effective date present in the second table (small table with 1 row), and this result-set will then be consumed by downstream queries.
Table Transact:
tran_id | cust_id | tran_amt | tran_dt
1234 | XYZ | 12.55 | 10/01/2020
5678 | MNP | 25.99 | 25/02/2020
5561 | XYZ | 32.45 | 30/04/2020
9812 | STR | 10.32 | 15/08/2020
Table REF:
eff_dt |
30/07/2020 |
Hence as per logic I should get back the first 3 rows and discard the last record since it is greater than the reference date (present in the REF table)
Hence, I have used a non-equi Cartesian Join between these tables as:
select
/*+ MAPJOIN(b) */
a.tran_id,
a.cust_id,
a.tran_amt,
a.tran_dt
from transact a
inner join ref b
on a.tran_dt <= b.eff_dt
However, this sql is taking forever to complete due to the cross Join with the transact table even using Broadcast hints.
So is there any smarter way to implement the same logic which will be more efficient than this ? In other words, is it possible to optimize the Theta join in this query ?
Thanks in advance.
So I wrote something like this:

Referring from https://databricks.com/session/optimizing-apache-spark-sql-joins
Can you try Bucketing on trans_dt (Bucketed on Year/Month only). And write 2 queries to do the same work
First query, trans_dt(Year/Month) < eff_dt(Year/Month). So this could help you actively picking up buckets(rather than checking each and every record trans_dt) which is less than 2020/07.
second query, trans_dt(Year/Month) = eff_dt(Year/Month) and trans_dt(Day) <= eff_dt(Day)

Cross Join for calculation in Spark SQL

I have a temporary view with only 1 record/value and I want to use that value to calculate the age of the customers present in another big table (with 100M rows). I used a CROSS JOIN clause, which is resulting in a performance issue.
Is there a better approach to implement this requirement which is will perform better ? Will a broadcast hint be suitable in this scenario ? What is the recommended approach to tackle such scenarios ?
Reference table: (contains only 1 value)
create temporary view ref
as
select to_date(refdt, 'dd-MM-yyyy') as refdt --returns only 1 value
from tableA
where logtype = 'A';
Cust table (10 M rows):
custid | birthdt
A1234 | 20-03-1980
B3456 | 09-05-1985
C2356 | 15-12-1990
Query (calculate age w.r.t birthdt):
select
a.custid,
a.birthdt,
cast((datediff(b.ref_dt, a.birthdt)/365.25) as int) as age
from cust a
cross join ref b;
My question is - Is there a better approach to implement this requirement ?
Thanks

Simply use withColumn!
df.withColumn("new_col", lit("10-05-2020").cast("date"))

Inside view you are using constant value, You can simply put same value in below query without cross join.
select
a.custid,
a.birthdt,
cast((datediff(to_date('10-05-2020', 'dd-MM-yyyy'), a.birthdt)/365.25) as int) as age
from cust a;
scala> spark.sql("select * from cust").show(false)
+------+----------+
|custid|birthdt |
+------+----------+
|A1234 |1980-03-20|
|B3456 |1985-05-09|
|C2356 |1990-12-15|
+------+----------+
scala> spark.sql("select a.custid, a.birthdt, cast((datediff(to_date('10-05-2020', 'dd-MM-yyyy'), a.birthdt)/365.25) as int) as age from cust a").show(false)
+------+----------+---+
|custid|birthdt |age|
+------+----------+---+
|A1234 |1980-03-20|40 |
|B3456 |1985-05-09|35 |
|C2356 |1990-12-15|29 |
+------+----------+---+

Hard to work out exactly your point, but if you cannot use Scala or pyspark and dataframes with .cache etc. then I think that instead of of using a temporary view, just create a single row table. My impression is you are using Spark %sql in a notebook on, say, Databricks.
This is my suspicion as it were.
That said a broadcastjoin hint may well mean the optimizer only sends out 1 row. See https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-hint-framework.html#specifying-query-hints

Cassandra two dimensional data modelling

The Use case:
For a game I am collecting the results of each game match. It's always Team A against Team B. Each team consists of 5 players each picking a champion and the possible outcome of a match is for one team either Won / Lost or for both teams a draw.
I would like to figure out the best champion combinationsI want to create win/lose/draw statistics based on the chosen champion combination of each team. In total there are ~100 champions a player can chose from. So there are many different champion combinations possible.
More (bonus) features:
I would like to figure out how one combination performed against another specific combination (in short: what's the best combination to counter a very strong champion combination)
As balance changes are applied to the game it makes sense to have a possibility to select / filter stats by specific timeranges (for instance past 14 days only) - daily precision is fine for that
My problem:
I wonder what's the best way to collect the statistics based on the champion combination? How would the data modelling look like?
My idea:
Create a hash of all championId in a combination which would literally represent a championCombinationId which is a unique identifier for the champion combo a team uses.
Create a two dimensional table which allows tracking combination vs combination stats. Something like this:
Timeframes (daily dates) and the actual championIds for a combinationId are missing there.
I tried myself creating a model for the above requirements, but I am absolutely not sure about it. Nor do I know what keys I would need to specify.
CREATE TABLE team_combination_statistics (
combinationIdA text, // Team A
combinationIdB text, // Team B
championIdsA text, // An array of all champion IDs of combination A
championIdsB text, // An array of all champion IDs of combination B
trackingTimeFrame text, // A date?
wins int,
losses int,
draws int
);

This question is quite long so I'll talk about different topics before suggesting my approach, be ready for a long answer:
Data normalization
Two-dimensional tables with same value axes
Data normalization
Storing total ammount of data is useful but ordering by it isn't, as the order doesn't determine if a combination is good vs another, it determines the combination that most times have won/lost vs the opposite but the total ammount of games played also matters.
When ordering the results, you want to order by win-ratio, draw-ratio, loose-ratio of two of the previous as the third is a linear combination.
Two-dimensional tables with same value axes
The problem on two-dimensional tables where both dimensions represent the same data, in this case a group of 5 champs, is that either you make a triangular table or you have data doubled as you will have to store cominationA vs combinationB and combinationB vs combinationA, being combinationX a specific group of 5 champs.
There are two aproaches here, using triangular tables or doubling the data manually:
1. Triangular tables:
You create a table where either the top right half is empty or the bottom left hand is empty. You then handle in the app which hash is A and which is B, and you may need to swap their order, as there is no duplicate data. You could for example consider alphabetical order where A < B always. If you then request the data in the wrong order you would get no data. The other option would be making both A vs B and B vs A query and then joining the results (swapping the wins and looses obviously).
2. Doubling the data manually:
By making two inserts with reflected values (A, B, wins, draws, looses & B, A, looses, draws, wins) you would duplicate the data. This lets you query in any order at the cost of using two times the space and requiring double inserts.
Pros and cons:
The pros of one approach are the cons of the other.
Pros of triangular tables
Does not store duplicate data
Requires half the insert
Pros of doubling the data
The application doesn't care in which order you make the request
I would probably use the triangular tables approach as the application complexity increase is not that big to be relevant, but the scalability does matter.
Proposed schema
Use whatever keyspace you want, I choose so from stackoverflow. Modify the replication strategy or factor as needed.
CREATE KEYSPACE so WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};
Champion names table
The champion table will contain info about the different champions, for now it will only hold the name but you could store other things in the future.
CREATE TABLE so.champions (
c boolean,
id smallint,
name text,
PRIMARY KEY(c, id)
) WITH comment='Champion names';
A boolean is used as the partition key as we want to store all champs in a single partition for query performance and there will be a low ammount of records (~100) we will always be using c=True. A smallint was choosen for the id as 2^7 = 128 was to close to the actual number of champs and to leave room for future champs without using the negative numbers.
When querying the champs you could get them all by doing:
SELECT id, name FROM so.champions WHERE c=True;
or request a specific one by:
SELECT name FROM so.champions WHERE c=True and id=XX;
Historic match results table
This table will store the results of the matches without agregating:
CREATE TABLE so.matches (
dt date,
ts time,
id XXXXXXXX,
teams list<frozen<set<smallint>>>,
winA boolean,
winB boolean,
PRIMARY KEY(dt, ts, id)
) WITH comment='Match results';
For the partition of an historic data table, and as you mentioned daily precission, date seems to be a nice partition key. A time column is used as the first clustering key for ordering reasons and to complete the timestamp, doesn't matter if these timestamp belong to the ending or finishing instant, choose one and stick with it. An additional identifier is required in the clustering key as 2 games may end in the same instant (time has nanosecond precission which would basically mean that the data lost to overlap would be quite insignificant but your data source will probably not have this precission, thus making this last key column necesary). You can use whatever type you want for this column, probably you will already have some king of identifier with the data that you can use here. You could also go for a random number, an incremental int managed by the application, or even the name of the first players as you can be sure the same player will not start/finish two games at the same second.
The teams column is the most important one: it stores the ids of the champs that were played in the game. A sequence of two elements is used, one for each team. The inner (frozen) set is for the champs id in each team, for example: {1,3,5,7,9}. I've tried a couple different options: set< frozen<set<smallint>> >, tuple< set<smallint>> , set<smallint> > and list< frozen<set<smallint>> >. The first options doesn't store the order of the teams, so we would have no way to know who win the game. The second one doesn't accept using an index on this column and doing partial searchs through CONTAINS so I've opted for the third that does keep the order and allows partial searchs.
The other two values are two booleans representing who won the game. You could have additional columns such a draw boolean one but this one is not necesary or duration time if you want to store the length of the game (I'm not using Cassandra's duration type on purpouse as it is only worth when it takes months or at least days), end timestamp/start timestamp if you want to store the one that you are not using in the partition and clustering key, etc.
Partial searchs
It may be useful to create an index on teams so that you are allowed to query on this column:
CREATE INDEX matchesByTeams ON so.matches( teams );
Then we can execute the following SELECT statenments:
SELECT * FROM so.matches WHERE teams CONTAINS {1,3,5,7,9};
SELECT * FROM so.matches WHERE teams CONTAINS {1,3,5,7,9} AND dt=toDate(now());
The first one would select the matches in which any of the teams selected that composition and the second one will further filter it to today's matches.
Stats cache table
With these two tables you can hold all the info, and then request the data you need to calculate the stats involved. Once you calculate some data, you could store this info back in Cassandra as a "cache" in an additional table so that when a user requests some stats to be shown, you first check if they were already calculated and if they weren't calculate. This table would need to have a column for each parameter that the user can enter, for example: champion composition, starting date, final date, enemy team; and additional columns for the stats themselves.
CREATE TABLE so.stats (
team frozen<set<smallint>>,
s_ts timestamp,
e_ts timestamp,
enemy frozen<set<smallint>>,
win_ratio float,
loose_ratio float,
wins int,
draws int,
looses int,
PRIMARY KEY(team, s_ts, e_ts, enemy)
) WITH comment="Already calculated queries";
Ordered by win/loose ratios:
To get the results order by ratios instead of enemy team you can use materialized views.
CREATE MATERIALIZED VIEW so.statsByWinRatio AS
SELECT * FROM so.stats
WHERE team IS NOT NULL AND s_ts IS NOT NULL AND e_ts IS NOT NULL AND win_ratio IS NOT NULL AND enemy IS NOT NULL
PRIMARY KEY(team, s_ts, e_ts, win_ratio, enemy)
WITH comment='Allow ordering by win ratio';
NOTE:
While I was answering I realized that introducing the concept of "patch" inside the DB so that the user is not allowed to determine dates but patches could be a better solution. If you are interested comment and I'll edit the answer to include the patch concept. It would mean modifying both the so.historic and so.stats tables a bit, but quite minor changes.

You can create a statistics table which holds game stats for a champion in a given day.
CREATE TABLE champion_stats_by_day (
champion_ids FROZEN<SET<INT>>,
competing_champion_ids FROZEN<SET<INT>>,
competition_day DATE,
win_ratio DECIMAL,
loss_ratio DECIMAL,
draw_ratio DECIMAL,
wins INT,
draws INT,
losses INT,
matches INT,
PRIMARY KEY(champion_ids, competition_day, competing_champion_ids)
) WITH CLUSTERING ORDER BY(competition_day DESC, competing_champion_ids ASC);
You can ask stats for a champion starting from a certain date, but you have to do the sorting / aggregation in the clients:
SELECT * FROM champion_stats_by_day WHERE champion_ids = {1,2,3,4} AND competition_day > '2017-10-17';
champion_ids | competition_day | competing_champion_ids | draw_ratio | draws | loss_ratio | losses | matches | win_ratio | wins
--------------+-----------------+------------------------+------------+-------+------------+--------+---------+-----------+------
{1, 2, 3, 4} | 2017-11-01 | {2, 9, 21, 33} | 0.04 | 4 | 0.57 | 48 | 84 | 0.38 | 32
{1, 2, 3, 4} | 2017-11-01 | {5, 6, 22, 32} | 0.008 | 2 | 0.55 | 128 | 229 | 0.43 | 99
{1, 2, 3, 4} | 2017-11-01 | {12, 21, 33, 55} | 0.04 | 4 | 0.57 | 48 | 84 | 0.38 | 32
{1, 2, 3, 4} | 2017-10-29 | {3, 8, 21, 42} | 0 | 0 | 0.992 | 128 | 129 | 0.007 | 1
{1, 2, 3, 4} | 2017-10-28 | {2, 9, 21, 33} | 0.23 | 40 | 0.04 | 8 | 169 | 0.71 | 121
{1, 2, 3, 4} | 2017-10-22 | {7, 12, 23, 44} | 0.57 | 64 | 0.02 | 3 | 112 | 0.4 | 45
Update & insert works as following. You first select the existing statistic for that date and champion ID and then do an update. In case, when row is not in the table it's not going to be a problem as Cassandra performs and UPSERT in this case.:
SELECT * FROM champion_stats_by_day WHERE champion_ids = {1,2,3,4} AND competing_champion_ids = {21,2,9,33} AND competition_day = '2017-11-01';
UPDATE champion_stats_by_day
SET win_ratio = 0.38, draw_ratio = 0.04, loss_ratio = 0.57, wins = 32, draws = 4, losses = 48, matches = 84
WHERE champion_ids = {1,2,3,4}
AND competing_champion_ids = {21,2,9,33}
AND competition_day = '2017-11-01';
I also added the sample CQL commands here.
Let me know what you think.

Duplicate rows/columns for the same primary key in Cassandra

I have a table/columnfamily in Cassandra 3.7 with sensordata.
CREATE TABLE test.sensor_data (
house_id int,
sensor_id int,
time_bucket int,
sensor_time timestamp,
sensor_reading map<int, float>,
PRIMARY KEY ((house_id, sensor_id, time_bucket), sensor_time)
)
Now when I select from this table I find duplicates for the same primary key, something I thought was impossible.
cqlsh:test> select * from sensor_data;
house_id | sensor_id | time_bucket | sensor_time | sensor_reading
----------+-----------+-------------+---------------------------------+----------------
1 | 2 | 3 | 2016-01-02 03:04:05.000000+0000 | {1: 101}
1 | 2 | 3 | 2016-01-02 03:04:05.000000+0000 | {1: 101}
I think part of the problem is that this data has both been written "live" using java and Datastax java driver, and it has been loaded together with historic data from another source using sstableloader.
Regardless, this shouldn't be possible.
I have no way of connecting with the legacy cassandra-cli to this cluster, perhaps that would have told me something that I can't see using cqlsh.
So, the questions are:
* Is there anyway this could happen under known circumstances?
* Can I read more raw data using cqlsh? Specifically write time of these two rows. the writetime()-function can't operate on primary keys or collections, and that is all I have.
Thanks.
Update:
This is what I've tried, from comments, answers and other sources
* selecting using blobAsBigInt gives the same big integer for all identical rows
* connecting using cassandra-cli, after enabling thrift, is possible but reading the table isn't. It's not supported after 3.x
* dumping out using sstabledump is ongoing but expected to take another week or two ;)

I don't expect to see nanoseconds in a timestamp field and additionally i'm of the impression they're fully not supported? Try this:
SELECT house_id, sensor_id, time_bucket, blobAsBigint(sensor_time) FROM test.sensor_data;
I WAS able to replicate it doing by inserting the rows via an integer:
INSERT INTO sensor_data(house_id, sensor_id, time_bucket, sensor_time) VALUES (1,2,4,1451692800000);
INSERT INTO sensor_data(house_id, sensor_id, time_bucket, sensor_time) VALUES (1,2,4,1451692800001);
This makes sense because I would suspect one of your drivers is using a bigint to insert the timestamp, and one is likely actually using the datetime.
Tried playing with both timezones and bigints to reproduce this... seems like only bigint is reproducable
house_id | sensor_id | time_bucket | sensor_time | sensor_reading
----------+-----------+-------------+--------------------------+----------------
1 | 2 | 3 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-01 23:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 00:00:00+0000 | null
1 | 2 | 4 | 2016-01-02 01:01:00+0000 | null
edit: Tried some shenanigans using bigint in place of datetime insert, managed to reproduce...

Adding some observations on top of what Nick mentioned,
Cassandra Primary key = one or combination of {Partition key(s) + Clustering key(s)}
Keeping in mind the concepts of partition keys used within angular brackets which can be simple (one key) or composite (multiple keys) for unique identification and clustering keys to sort data, the below have been observed.
Query using select: sufficient to query using all the partition key(s) provided, additionally can query using clustering key(s) but in the same order in which they have been mentioned in primary key during table creation.
Update using set or update: the update statement needs to have search/condition clauses which not only include all the partition key(s) but also all the clustering key(s)
Answering the question - Is there anyway this could happen under known circumstances?
Yes, it is possible when same data is inserted from different sources.
To explain further, incase one tries to insert data from code (API etc) into Cassandra and then tries inserting the same data from DataStax Studio/any tool used to perform direct querying, a duplicate record is inserted.
Incase the same data is being pushed multiple times either from code alone or querying tool alone or from another source used to do the same operation multiple times, the data behaves idempotently and is not inserted again.
The possible explanation could be the way the underlying storage engine computes internal indexes or hashes to identify a row pertaining to set of columns (since column based).
Note:
The above information of duplicacy incase same data is pushed from different sources has been observed, tested and validated.
Language used: C#
Framework: .NET Core 3

"sensor_time" is part of the primary key. It is not in "Partition Key", but is "Clustering Column". this is why you get two "rows".
However, in the disk table, both "visual rows" are stored on single Cassandra row. In reality, they are just different columns and CQL just pretend they are two "visual rows".
Clarification - I did not worked with Cassandra for a while so I might not use correct terms. When i say "visual rows", I mean what CQL result shows.
Update
You can create following experiment (please ignore and fix any syntax errors I will do).
This suppose to do table with composite primary key:
"state" is "Partition Key" and
"city" is "Clustering Column".
create table cities(
state int,
city int,
name text,
primary key((state), city)
);
insert into cities(state, city, name)values(1, 1, 'New York');
insert into cities(state, city, name)values(1, 2, 'Corona');
select * from cities where state = 1;
this will return something like:
1, 1, New York
1, 2, Corona
But on the disk this will be stored on single row like this:
+-------+-----------------+-----------------+
| state | city = 1 | city = 2 |
| +-----------------+-----------------+
| | city | name | city | name |
+-------+------+----------+------+----------+
| 1 | 1 | New York | 2 | Corona |
+-------+------+----------+------+----------+
When you have such composite primary key you can select or delete on it, e.g.
select * from cities where state = 1;
delete from cities where state = 1;
In the question, primary key is defined as:
PRIMARY KEY ((house_id, sensor_id, time_bucket), sensor_time)
this means
"house_id", "sensor_id", "time_bucket" is "Partition Key" and
"sensor_time" is the "Clustering Column".
So when you select, the real row is spitted and show as if there are several rows.
Update
http://www.planetcassandra.org/blog/primary-keys-in-cql/
The PRIMARY KEY definition is made up of two parts: the Partition Key
and the Clustering Columns. The first part maps to the storage engine
row key, while the second is used to group columns in a row. In the
storage engine the columns are grouped by prefixing their name with
the value of the clustering columns. This is a standard design pattern
when using the Thrift API. But now CQL takes care of transposing the
clustering column values to and from the non key fields in the table.
Then read the explanations in "The Composite Enchilada".

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string