How can one interleave with two tables? - google-cloud-spanner

Let's pretend I have the schema
CREATE TABLE Account (
AccountId BYTES(MAX),
Foo STRING(1024)
) PRIMARY KEY (AccountId);"
CREATE TABLE Customer (
CustomerId BYTES(MAX),
Bar STRING(1024)
) PRIMARY KEY (CustomerId);"
And I create a new table:
CREATE TABLE Order (
AccountId BYTES(MAX),
CustomerId BYTES(MAX),
Baz STRING(1024)
) PRIMARY KEY (AccountId, CustomerId);"
That I'd like to INTERLEAVE with Account and Customer. How can one do this? I'm familiar with how to INTERLEAVE with one table, when building a hierarchy, but not sure how to achieve this with two tables.

You cannot interleave one table in two other tables, but you can create a hierarchy of interleaved tables. In your example, that would mean interleaving the Customer table in the Account table, and the Order table in the Customer table like this:
CREATE TABLE Account (
AccountId BYTES(MAX),
Foo STRING(1024)
) PRIMARY KEY (AccountId);
CREATE TABLE Customer (
AccountId BYTES(MAX),
CustomerId BYTES(MAX),
Bar STRING(1024)
) PRIMARY KEY (AccountId, CustomerId),
INTERLEAVE IN PARENT Account;
CREATE TABLE Order (
AccountId BYTES(MAX),
CustomerId BYTES(MAX),
OrderId BYTES(MAX),
Baz STRING(1024)
) PRIMARY KEY (AccountId, CustomerId, OrderId),
INTERLEAVE IN PARENT Customer;
The reason that you cannot interleave one table in two other tables in the way that you ask, is that interleaving tables actually means that Cloud Spanner will store the rows of the interleaved child table physically together with the parent table. There's no way to determine where to store the child rows if you were to interleave a table with two different, unrelated parent tables.

Related

Not able to run multiple where clause without Cassandra allow filtering

Hi I am new to Cassandra.
We are working on IOT project where car sensor data will be stored in cassandra.
Here is the example of one table where I am going to store one of the sensor data.
This is some sample data.
The way I want to partition the data is based on the organization_id so that different organization data is partitioned.
Here is the create table command:
CREATE TABLE IF NOT EXISTS engine_speed (
id UUID,
engine_speed_rpm text,
position int,
vin_number text,
last_updated timestamp,
organization_id int,
odometer int,
PRIMARY KEY ((id, organization_id), vin_number)
);
This works fine. However all my queries will be as bellow:
select * from engine_speed
where vin_number='xyz'
and organization_id = 1
and last_updated >='from time stamp' and last_updated <='to timestamp'
Almost all queries in all the table will have similar / same where clause.
I am getting error and it is asking to add "Allow filtering".
Kindly let me know how do I partition the table and define right primary key and indexs so that I don't have to add "allow filtering" in the query.
Apologies for this basic question but I'm just starting using cassandra.(using apache cassandra:3.11.12 )
The order of where clause should match with the order of partition and clustering keys you have defined in your DDL and you cannot skip any part of primary key while applying the WHERE clause before using the next key. So as per the query pattern u have defined, you can try the below DDL:
CREATE TABLE IF NOT EXISTS autonostix360.engine_speed (
vin_number text,
organization_id int,
last_updated timestamp,
id UUID,
engine_speed_rpm text,
position int,
odometer int,
PRIMARY KEY ((vin_number, organization_id), last_updated)
);
But remember,
PRIMARY KEY ((vin_number, organization_id), last_updated)
PRIMARY KEY ((vin_number), organization_id, last_updated)
above two are different in Cassandra, In case 1 your data will be partitioned by combination of vin_number and organization_id while last_updated will act as ordering key. In case 2, your data will be partitioned only by vin_number while organization_id and last_updated will act as ordering key. So you need to figure out which case suits your use case.

Query by Interleaved table fields using Spring Data Spanner

I'm trying to query by a field of a Interleaved table using Spring Data Spanner. The id comparison is automatically done by Spring Data Spanner when it does the ARRAY STRUCT inner join, but I'm not being able to add a WHERE clause to the Interleaved table query.
Considering the example below:
CREATE TABLE Singers (
Id INT64 NOT NULL,
FirstName STRING(1024),
LastName STRING(1024),
SingerInfo BYTES(MAX),
) PRIMARY KEY (Id);
CREATE TABLE Albums (
SingerId INT64 NOT NULL,
Id INT64 NOT NULL,
AlbumTitle STRING(MAX),
) PRIMARY KEY (SingerId, Id),
INTERLEAVE IN PARENT Singers ON DELETE CASCADE;
Let's suppose I want to query all Singers where the AlbumTitle is "Fear of the Dark", how can I write a repository method to achieve that using Spring Data Spanner?
You're example seems to either contain a couple of typos, or it is otherwise not completely correct:
The Singers table has a column Id which is the primary key. That is in itself fine, but when creating a hierarchy of interleaved tables, it is recommended to prefix the primary key column with the table name. So it would be better to name it SingerId.
The Albums table has a SingerId column and an Id column. These two columns form the primary key of the Albums table. This is technically incorrect (and confusing), and also the reason that I think that your example is not completely correct. Because Albums is interleaved in Singers, Albums must contain the same primary key columns as the Singers table, in addition to any additional columns that form the primary key of Albums. In this case Id references the Singers table, and the SingerId is an additional column in the Albums table that has nothing to do with the Singers table. The primary key columns of the parent table must also appear in the same order as in the parent table.
The example data model should therefore be changed to:
CREATE TABLE Singers (
SingerId INT64 NOT NULL,
FirstName STRING(1024),
LastName STRING(1024),
SingerInfo BYTES(MAX),
) PRIMARY KEY (SingerId);
CREATE TABLE Albums (
SingerId INT64 NOT NULL,
AlbumId INT64 NOT NULL,
AlbumTitle STRING(MAX),
) PRIMARY KEY (SingerId, AlbumId),
INTERLEAVE IN PARENT Singers ON DELETE CASCADE;
From this point on you can consider the SingerId column in the Albums table as a foreign key relationship to a Singer and treat it as you would in any other database system. Note also that there can be multiple albums for each singer, so a query for ...I want to query all Singers where the AlbumTitle is "Fear of the Dark" is slightly ambiguous. I would rather say:
Give me all singers that have at least one album with the title "Fear of the Dark"
A valid query for that would be:
SELECT *
FROM Singers
WHERE SingerId IN (
SELECT SingerId
FROM Albums
WHERE AlbumTitle='Fear of the Dark'
)

How to use static column in scylladb and cassandra?

I am new in scylladb and cassandra, I am facing some issues in querying data from the table, following is the schema I have created:
CREATE TABLE usercontacts (
userID bigint, -- userID
contactID bigint, -- Contact ID lynkApp userID
contactDeviceToken text, -- Device Token
modifyDate timestamp static ,
PRIMARY KEY (contactID,userID)
);
CREATE MATERIALIZED VIEW usercontacts_by_contactid
AS SELECT userID, contactID, contactDeviceToken,
FROM usercontacts
contactID IS NOT NULL AND userID IS NOT NULL AND modifyDate IS NOT NULL
-- Need to not null as these are the primary keys in main
-- table same structure as the main table
PRIMARY KEY(userID,contactID);
CREATE MATERIALIZED VIEW usercontacts_by_modifyDate
AS SELECT userID,contactID,contactDeviceToken,modifyDate
FROM usercontacts WHERE
contactID IS NOT NULL AND userID IS NOT NULL AND modifyDate IS NOT NULL
-- Need to not null as these are the primary keys in main
-- table same structure as the main table
PRIMARY KEY (modifyDate,contactID);
I want to create materialized view for contact table which is usercontacts_by_userid and usercontacts_by_modifydate
I need the following queries in case of when I set modifydate (timestamp) static:
update usercontacts set modifydate="newdate" where contactid="contactid"
select * from usercontacts_by_modifydate where modifydate="modifydate"
delete from usercontacts where contactid="contactid"
It is not currently possible to create a materialized view that includes a static column, either as part of the primary key or just as a regular column.
Including a static row would require the whole base table (usercontacts) to be read when the static column is changed, so that the view rows could be re-calculated. This has a significant performance penalty.
Having the static row be the view's partition key means that there would only be one entry in the view for all the rows of a partition. However, secondary indexes do work in this case, and you can use that instead.
This is valid for both Scylla and Cassandra at the moment.

Modeling MultiTenant in Cassandra

I have several customers each represented by a "tenant"
I would like to know what is the best way to modelize this concept. I did a lot of research and found this topic : http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Modeling-multi-tenanted-Cassandra-schema-td7591311.html
I know there are several possibilities
One keyspace by tenant
One table (column family) by tenant
One field represented the tenant in all tables
I choose the solution 3 but I'm not sure to have the best schema for the best performances
This is my profile schema
CREATE TABLE profiles (
id timeuuid,
tenant text,
email text,
datasources set<text>,
info map<text, text>,
friends set<timeuuid>,
PRIMARY KEY(id, tenant)
);
CREATE INDEX ON profiles(datasources);
CREATE INDEX ON profiles(email);
My PARTITION KEY is "id" for the uniqueness and CLUSTERING KEY "tenant".
My need is to be able to execute this queries as quickly as possible
SELECT * FROM profiles WHERE id = x
SELECT * FROM profiles WHERE tenant = x
SELECT * FROM profiles WHERE email = x
SELECT * FROM profiles WHERE datasources CONTAINS x
Queries are OK but I wondered if it would be better to have "tenant" as PARTITION KEY instead of "id", and use "id" as CLUSTERING KEY
CREATE TABLE profiles (
...
PRIMARY KEY(tenant, id)
);
In my application "tenant" is always a required field so make the same queries in this way would not be a problem (but is it faster or slower ?)
SELECT * FROM profiles WHERE tenant = y
SELECT * FROM profiles WHERE tenant = y AND id = x
SELECT * FROM profiles WHERE tenant = y AND email = x
SELECT * FROM profiles WHERE tenant = y AND datasources CONTAINS x
Bonus advantage: the ability to sort profiles by creation date (ORDER BY id)
Using tenant as PARTITION KEY if I understand well, Cassandra will physically store all elements of the same tenant in the same row and would be potentially able to store up to 2 billion data in this row, in this case what would happen if one of my customers in excess of that number ? I also read we could use a composite key for example by putting the current date (20150313) in the second part of the key to group in one row only all new profiles of the day for the tenant
CREATE TABLE profiles (
...
date text,
PRIMARY KEY((tenant, date), id)
);
but with this solution no query is possible to query all data (without date in query).
Also as you can see in my schema I use secondary index for "email" and "datasources" fields. But I read here http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_when_use_index_c.html that using secondary index on a huge table that returns a small number of results (one in my case) was a bad practice. In my schema "datasources" is a set containing for exemple facebookId, twitterId etc
If you have any ideas I'm really interested :) ! I'm pretty new with Cassandra if there are things I do not understand please tell me
thanks,
Donovan
Data duplication with Cassandra is not a problem, so you have to think the data modelling process starting with your queries.
So, I'm thinking about something like this:
CREATE TABLE profiles (
id timeuuid,
tenant text,
email text,
datasources set<text>,
info map<text, text>,
friends set<timeuuid>,
PRIMARY KEY((id, tenant))
);
Assuming that tenant is known at the application level, this mode will give you the following queries run fast:
SELECT * FROM profiles WHERE id = x and tenant = y
CREATE TABLE profiles_emails (
id timeuuid,
tenant text,
email text,
datasources set<text>,
info map<text, text>,
friends set<timeuuid>,
PRIMARY KEY((email, tenant))
);
SELECT * FROM profiles WHERE email = x and tenant = y
CREATE TABLE profiles_tenants (
id timeuuid,
tenant text,
email text,
datasources set<text>,
info map<text, text>,
friends set<timeuuid>,
PRIMARY KEY((tenant, id))
);
SELECT * FROM profiles WHERE tenant = x and id = y
CREATE TABLE tenants (
id timeuuid,
tenant text,
email text,
datasources set<text>,
info map<text, text>,
friends set<timeuuid>,
PRIMARY KEY((tenant, date))
);
SELECT * FROM profiles WHERE tenant = x and date < y
or you may look to http://www.datastax.com/documentation/cql/3.0/cql/cql_using/paging_c.html
For "datasources" based search, you may use a different system like elasticsearch or solr. Or if the set is limited in values, then you may maintain a separate table for each of it.
Cassandra is fast at write operation, data duplication is not a problem, so you may write to all those tables in a batch.
You have also to take in consideration the consistency level, it has an impact on READ performance. Really depending on your use-case.

Cassandra order by on combination of composite keys

I originally wrote a table that tracks feeds that have been assigned to a user for review.
create table user_feed
{
userid uuid,
languageid uuid,
topicid_uuid,
dateinserted timeuuid,
primary key (userid, languageid, topicid, dateinserted)
};
I realized soon after I created this table that I wouldn't be able to sort this table (order by DESC) by dateinserted because for some weird reason, in Cassandra I can only order by the second (and last) column of a composite key table (as in, the table has to have 2 composite keys and order by can only happen on the second column of this key) so I changed my table to this:
create table user_feed
{
userid uuid,
languageid uuid,
topicid_uuid,
dateinserted timeuuid,
primary key (userid, dateinserted)
};
and now I was able to run a query to get the latest feeds for the user, using order by.
However, I have a new requirement that requires me to sort the feeds by a combination of (languageid + userid) or (topicid + userid) or (languageid + topicid + userid).
I had an idea to create three new tables and have the keys combined into one key column. For example, for userid + topic query, I would use:
create table user_feed_by_topic
{
usertopicidkey text,
dateinserted timeuuid,
primary key (usertopicidkey, dateinserted)
};
where usertopididkey = userid.toString() + topicid.toString().
Of course, this solution requires 4 separate inserts whenever I need to insert a new feed row since I have 4 rows, tracking identical data but partitioned differently to allow sorting.
My question is, is there a better way to do this? Is there any way to achieve what I want (query by a combination of columns and order by another column) or am I stuck with my 4 table design approach?
Many thanks,
Cassandra will order all rows based on the PKs clustering columns. In case your PK is primary key (userid, languageid, topicid, dateinserted) all rows will be sorted by languageid, topicid and dateinserted in ascending order. This implies that all rows will only be sorted within a specific language and topic by date. You'd have to use the date as the first clustering key column to change this behaviour.
Its common practice to denormalize your data across multiple tables to implement different ordering strategies.

Resources