Does INTERLEAVE support the transitive property? - google-cloud-spanner

Lets say I have three tables, Foo, Bar and Baz
CREATE TABLE Foo (
FooId BYTES(MAX)
) PRIMARY KEY (FooId);
CREATE TABLE Bar (
FooId BYTES(MAX),
BarId BYTES(MAX)
) PRIMARY KEY (FooId, BarId),
INTERLEAVE IN PARENT Foo on DELETE CASCADE;
CREATE TABLE Baz (
FooId BYTES(MAX),
BazId BYTES(MAX)
) PRIMARY KEY (FooId, BazId),
INTERLEAVE IN PARENT Foo on DELETE CASCADE;
This diagram demonstrates the INTERLEAVE hierarchy:
+-----+ +-----+
| | | |
| Foo <---+ Bar |
| | | |
+--^--+ +-----+
|
+--+--+
| |
| Baz |
| |
+-----+
Cloud Spanner's documentation Describes INTERLEAVE as
You can define hierarchies of parent-child relationships between tables up to seven layers deep, which means you can co-locate rows of seven logically independent tables.
Since we used INTERLEAVE between Bar and Foo, I believe we're guaranteed that they are co-located, with the same keys. Just like how we INTERLEAVE Baz and Bar, they should be co-located as well, with the same keys.
If that's true, then are we guaranteed that Bar and Baz are co-located, with the same keys, as well?

In the scenario described, Bar and Baz are collocated (or in the same split) given a FooId key. Within a split, the rows from each table are grouped together.

Related

AWS Athena (Presto) how to transpose map to columns

AWS Athena query question;
I have a nested map in my rows, of which I would like to transpose the keys to columns.
I could name the columns explicitly like items['label_a'], but in this case the keys are actually dynamic...
From these rows:
{id=1, items={label_a=foo, label_b=foo}}
{id=2, items={label_a=bar, label_c=bar}}
{id=3, items={label_b=baz, label_c=baz}}
I would like to get a table like so:
| id | label_a | label_b | label_c |
------------------------------------
| 1 | foo | foo | |
| 2 | bar | | bar |
| 3 | | baz | baz |
Is that possible and how to do this in aws athena (presto version 0.172)?
Thanks!
This is not possible in a dynamic manner due to the fact that output columns need to be know to the planner before the query execution starts.
See the previous discussion here: https://github.com/prestosql/presto/issues/2448 and https://github.com/prestosql/presto/issues/1206.

How to retrieve data from cassandra using IN Query with given order?

I'm selecting data from a Cassandra database using a query. It is working fine but how to get the data in same order as I have given IN query?
I have created table like this:
id | n | p | q
----+---+---+------
5 | 1 | 2 | 4
10 | 2 | 4 | 3
11 | 1 | 2 | null
I am trying to select data using
SELECT *
FROM malleshdmy
WHERE id IN ( 11,10,5)
But, It producing same data as like stored.
id | n | p | q
----+---+---+------
5 | 1 | 2 | 4
10 | 2 | 4 | 3
11 | 1 | 2 | null
Please help me in this issue.
I want data as 11,10 and 5
If the id is partition key, then it's impossible - data are sorted only inside the clustering columns, and data for different partition keys could be returned in arbitrary order (but sorted inside that partition).
You need to sort data yourself.
Since id is your partition key, your data is actually being sorted by the token of id, not the values themselves:
cqlsh:testid> SELECT id,n,p,q,token(id) FROM table;
id | n | p | q | system.token(id)
----+---+---+------+----------------------
5 | 1 | 2 | 4 | -7509452495886106294
10 | 2 | 4 | 3 | -6715243485458697746
11 | 1 | 2 | null | -4156302194539278891
Because of this, you don't have any control over how the partition key is sorted.
In order to sort your data by id, you need to make id a clustering column rather than a partition key. Your data will still need a partition key, however, and this will always be sorted by token.
If you decide to make id a clustering column, you will need to specify that you want a descending order in your order by statement
CREATE TABLE clusterTable (
... partition type, //partition key with a type to be specified
... id INT,
... n INT,
... p INT,
... q INT,
... PRIMARY KEY((partition),id))
... WITH CLUSTERING ORDER BY (id DESC);
This link is very helpful in discussing how ordering works in Cassandra: https://www.datastax.com/dev/blog/we-shall-have-order

Get first and last item without using two joins

Currently I have two dataset, one is parent, and one is child. Child dataset contain "parentId" column that can link to parent table. Child dataset hold data about actions of a person, and parent table hold data about person. I want to get a dataset contain person info and his first/last action.
Dataset look like this:
Parent:
id | name | gender
111| Alex | Male
222| Alice| Female
Child:
parentId | time | Action
111 | 12:01| Walk
111 | 12:03| Run
222 | 12:04| Walk
111 | 12:05| Jump
111 | 12:06| Run
The dataset I want to produce is:
id | name | gender | firstAction | lastAction |
111| Alex | Male | Walk | Run |
222| Alice| Female | Walk | Walk |
Currently I can achieve this using two window functions, something like:
WindowSepc w1 = Window.partitionBy("parentId").orderBy(col("time").asc())
WindowSepc w2 = Window.partitionBy("parentId").orderBy(col("time").desc())
and apply the windowSpec to child table using row_number().over(), like:
child.withColumn("rank1", row_numbers().over(w1))
.withColumn("rank2", row_numbers().over(w2))
The issue I have is that later, when I need to join with the parent table, I need to join two times, one for parentId=id && rank1=1, and another one for parentId=id && rank2=1
I wonder if there is a way to only join once, which will be much more efficient.
Or I used the Window function incorrectly, and there is a better way to do it?
Thanks
You could join first and then use groupBy instead of window-functions, this could work (not tested as no programmatic dataframe is provided):
parent
.join(child,$"parentId"===$"id")
.groupBy($"parentId",$"name",$"gender")
.agg(
min(struct($"time",$"action")).as("firstAction"),
max(struct($"time",$"action")).as("lastAction")
)
.select($"parentId",
$"name",
$"gender",
$"firstAction.action".as("firstAction"),
$"lastAction.action".as("lastAction")
)

Cassandra: low cardinality partition

Let's say I have a table, something like this:
CREATE TABLE Users (
user UUID,
seq INT,
group TEXT,
time BIGINT,
PRIMARY KEY ((user), seq)
);
This follows the desired pattern of Cassandra, with good distribution across partitions (assuming the default Murmur3 hash partitioner).
However, I also need to (rarely) perform range queries on and in time order. This doesn't seem possible in Cassandra. In reality I do need to access the data by group, so (group, time) is acceptable. Since there doesn't seem a way to have secondary index have multiple columns, I guess the right thing is to denormalize, into something like this:
CREATE TABLE UsersByGroupTime (
user UUID,
seq INT,
group TEXT,
time BIGINT,
PRIMARY KEY ((group), time)
) WITH CLUSTERING ORDER BY (time ASC);
This works entirely as it should, except that group is really low cardinality, let's say ('A','B','C'), and uneven distribution across users. Since queries on that table is rare, I'm not worried about hot nodes, but I am worried about uneven distribution, perhaps even a single node getting all.
Is this a common scenario and is there any way to mitigate this or are there alternative solutions?
One technique to help avoid hot-spots in Cassandra time series models, is in making use of a "time bucket." Essentially what you would do is determine the "happy medium" level of time precision that provides adequate data distribution, while also being known and semi-convenient to query by.
For the purposes of this example, I'll choose year and month ("yyyyMM"). Note: I have no idea if year and month will work for you...it's just an example. Once you determine your time bucket, you would add it as an additional partition key, like this:
CREATE TABLE UsersByGroupTime (
user UUID,
seq INT,
group TEXT,
time TIMEUUID,
yearmonth BIGINT,
PRIMARY KEY ((group, yearmonth), time)
) WITH CLUSTERING ORDER BY (time DESC);
After inserting some rows, queries like this will work:
aploetz#cqlsh:stackoverflow2> SELECT group, yearmonth, dateof(time), time, seq, user
FROM usersbygrouptime WHERE group='B' AND yearmonth=201505;
group | yearmonth | dateof(time) | time | seq | user
-------+-----------+--------------------------+--------------------------------------+-----+--------------------------------------
B | 201505 | 2015-05-16 10:04:10-0500 | ceda56f0-fbdc-11e4-bd43-21b264d4c94d | 1 | d57ba8a4-db24-440c-a983-b1dd6b0d2e27
B | 201505 | 2015-05-16 10:04:09-0500 | ce1cac40-fbdc-11e4-bd43-21b264d4c94d | 1 | 66d07cbb-a2ff-4d56-8fa1-14dfaf684474
B | 201505 | 2015-05-16 10:04:08-0500 | cd525760-fbdc-11e4-bd43-21b264d4c94d | 1 | 07b589ac-4d5f-401e-a34f-e3479e269e01
B | 201505 | 2015-05-16 10:04:06-0500 | cc76c470-fbdc-11e4-bd43-21b264d4c94d | 1 | 984f85b5-ea58-4cf8-b512-43abacb227c9
(4 rows)
Now that may or may not help you query-wise, so you will need to spend some time ensuring that you pick an appropriate time bucket. But, this does help in terms of data distribution in the ring, which you can see with the token function:
aploetz#cqlsh:stackoverflow2> SELECT group, yearmonth, token(group,yearmonth)
FROM usersbygrouptime ;
group | yearmonth | token(group, yearmonth)
-------+-----------+-------------------------
A | 201503 | -3784784210711042553
A | 201504 | -610775546464185720
B | 201505 | 6232834565276653514
B | 201505 | 6232834565276653514
B | 201505 | 6232834565276653514
B | 201505 | 6232834565276653514
A | 201505 | 8281745497436252453
A | 201505 | 8281745497436252453
A | 201505 | 8281745497436252453
A | 201505 | 8281745497436252453
A | 201505 | 8281745497436252453
A | 201505 | 8281745497436252453
(12 rows)
Notice how different tokens are generated for each group/yearmonth pair, even though some of them have the same group ("A").

Row Inserts having same primary key, are replacing previous writes in Cassandra

Created a table in Cassandra where the primary key is based on two columns(groupname,type). When I'm trying to insert more than 1 row where the groupname and type is same, then in such situation its not storing more than one row, subsequent writes where in the groupname and type are same.. then the latest write is replacing the previous similar writes. Why Cassandra is replacing in this manner instead of writing every row im inserting?
Write 1
cqlsh:resto> insert into restmaster (rest_id,type,rname,groupname,address,city,country)values(blobAsUuid(timeuuidAsBlob(now())),'SportsBar','SportsDen','VK Group','Majestic','Bangalore','India');
Write 2
insert into restmaster (rest_id,type,rname,groupname,address,city,country)values(blobAsUuid(timeuuidAsBlob(now())),'SportsBar','Sports Spot','VK Group','Bandra','Mumbai','India');
Write 3
cqlsh:resto> insert into restmaster (rest_id,type,rname,groupname,address,city,country)values(blobAsUuid(timeuuidAsBlob(now())),'SportsBar','Cricket Heaven ','VK Group','Connaught Place','New Delhi','India');
The result Im expecting(check rows 4,5,6)
groupname | type | rname
----------------+------------+-----------------
none | Udipi | Gayatri Bhavan
none | dinein | Blue Diamond
VK Group | FoodCourt | FoodLion
VK Group | SportsBar | Sports Den
VK Group | SportsBar | Sports Spot
VK Group | SportsBar | Cricket Heaven
Viceroy Group | Vegetarian | Palace Heights
Mainland Group | Chinese | MainLand China
JSP Group | FoodCourt | Nautanki
Ohris | FoodCourt | Ohris
But this is the actual result (write 3 has replaced previous 2 inserts [rows 4,5])
cqlsh:resto> select groupname,type,rname From restmaster;
groupname | type | rname
----------------+------------+-----------------
none | Udipi | Gayatri Bhavan
none | dinein | Blue Diamond
VK Group | FoodCourt | FoodLion
VK Group | SportsBar | Cricket Heaven
Viceroy Group | Vegetarian | Palace Heights
Mainland Group | Chinese | MainLand China
JSP Group | FoodCourt | Nautanki
Ohris | FoodCourt | Ohris
cqlsh:resto> describe table restmaster;
CREATE TABLE restmaster (
groupname text,
type text,
address text,
city text,
country text,
rest_id uuid,
rname text,
PRIMARY KEY ((groupname), type)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.100000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.000000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
All inserts to the Cassandra database are actually insert/update operations and there can only be on set of non-key values per uniquely defined primary key. This means that you can not ever have more than one set of values for one primary key and that you will only see the last write.
More info:
http://www.datastax.com/documentation/cql/3.1/cql/cql_intro_c.html
Update: A datamodel
If you used a key like
Primary Key ((groupname),type,rname)
As long as you have unique restaurant names you will be able to get the results you are expecting. But what you really should be asking is "What queries would I like to perform on this data?" All Cassandra Tables should be based around satisfying a class of queries. The key I wrote above basically says "This table is constructed to quickly look up all the restaurants in a particular group and the only conditionals I will use will be on type and on restaurant name"
Examples queries you could perform with that schema
SELECT * FROM restmaster WHERE groupname = 'Lettuce Entertain You' ;
SELECT * FROM restmaster WHERE groupname = 'Lettuce Entertain You' and type = 'Formal' ;
SELECT * FROM restmaster WHERE groupname = 'Lettuce Entertain You' and type = 'Formal'
and rname > 'C' and rname < 'Y' ;
If that isn't the kind of queries you want to be performing in your application or you want other queries in addition to those, you most likely will need additional tables.

Resources