Partitioned table inside a colocated database in YugabyteDB YSQL - yugabytedb

[Question posted by a user on YugabyteDB Community Slack]
Is it possible to create a partitioned table in a colocated database?
When the database is created with colocated=true and trying to add a partitioned table like this:
create table test(id bigserial not null, PRIMARY KEY(id HASH)) PARTITION BY RANGE WITH (colocated = false);
I’m getting an error
Query 1 ERROR: ERROR: syntax error at or near “WITH” LINE 3: PRIMARY KEY(id HASH)) PARTITION BY RANGE WITH (colocated=t…
Is it possible to do this or should I think about some other approach? I’m trying to do geo-partitioning and at the same time have some of the tables colocated.

The syntax is wrong: you need to specify which columns to PARTITION BY RANGE. For example, PARTITION BY RANGE (id) (but then why is it a hash primary key?)
You can't have a hash partitioned table for colocation. In your case, since the table is partitioned, it should work (as long as you fix the syntax error), but all partitions under it can't be colocated.
Taking into account the above, you can have something like:
create table new (id bigserial not null, PRIMARY KEY (id ASC)) partition by range(id);
create table new_1 partition of new for values from (5) to (10) with (colocated = false);
create table new_2 partition of new for values from (20) to (30) with (colocated = true);
You can't shard by hash if you want to set colocated=true. It works fine with colocated=false:
create table new (id bigserial not null, value text) partition by range(id);
create table new_1 partition of new (primary key(id hash)) for values from (0) to (5) with (colocated = false);
create table new_2 partition of new (primary key(id hash)) for values from (5) to (10) with (colocated = false);

Related

How can I set Cassandra primary key?

I specify 2 unique data types, but when one of them is different, it keeps adding records.
The table schema has a compound primary key, i.e. it is composed of a partition key (username) and clustering key (email). This means that each partition has one or more rows of emails.
It is a completely different schema to a table with just a simple primary key (only has a partition key, no clustering key) like this:
CREATE TABLE users_by_username (
username text,
...
PRIMARY KEY (username)
)
This table would only ever have one row in each partition. Cheers!
[UPDATE] If you want your table to be partitioned by BOTH username + email, you need to create a new table which has a composite partition key (partition key has two or more columns):
CREATE TABLE users_by_username_email (
username text,
email text,
...
PRIMARY KEY ( (username, email) )
)
Note the difference: BOTH columns are enclosed in a bracket so they are treated as one key.

Spark SQL : INSERT Statement with JDBC does not support default value

I am trying to read/write data from other databases using JDBC.
just following the doc https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
But I found Spark SQL does not work well with Default value or AUTO_INCREMENT
CREATE TEMPORARY VIEW jdbcTable
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:postgresql:dbserver",
dbtable "schema.tablename",
user 'username',
password 'password'
)
INSERT INTO TABLE jdbcTable (id) values (1)
Here is my DDL
CREATE TABLE `tablename` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`age` int(11) NULL DEFAULT 0,
PRIMARY KEY (`id`) USING BTREE
)
The error org.apache.spark.sql.AnalysisException: unknown requires that the data to be inserted have the same number of columns as the target table: target table has 2 column(s) but the inserted data has 1 column(s), including 0 partition column(s) having constant value(s).
Is there any way to support Default value or AUTO_INCREMENT? thx
I have discovered this same issue with columns with DEFAULT and also COMPUTED columns. If you are using SQL Server you can consider an AFTER INSERT TRIGGER otherwise you may need to calculate the id on the INSERT side.

How to use static column in scylladb and cassandra?

I am new in scylladb and cassandra, I am facing some issues in querying data from the table, following is the schema I have created:
CREATE TABLE usercontacts (
userID bigint, -- userID
contactID bigint, -- Contact ID lynkApp userID
contactDeviceToken text, -- Device Token
modifyDate timestamp static ,
PRIMARY KEY (contactID,userID)
);
CREATE MATERIALIZED VIEW usercontacts_by_contactid
AS SELECT userID, contactID, contactDeviceToken,
FROM usercontacts
contactID IS NOT NULL AND userID IS NOT NULL AND modifyDate IS NOT NULL
-- Need to not null as these are the primary keys in main
-- table same structure as the main table
PRIMARY KEY(userID,contactID);
CREATE MATERIALIZED VIEW usercontacts_by_modifyDate
AS SELECT userID,contactID,contactDeviceToken,modifyDate
FROM usercontacts WHERE
contactID IS NOT NULL AND userID IS NOT NULL AND modifyDate IS NOT NULL
-- Need to not null as these are the primary keys in main
-- table same structure as the main table
PRIMARY KEY (modifyDate,contactID);
I want to create materialized view for contact table which is usercontacts_by_userid and usercontacts_by_modifydate
I need the following queries in case of when I set modifydate (timestamp) static:
update usercontacts set modifydate="newdate" where contactid="contactid"
select * from usercontacts_by_modifydate where modifydate="modifydate"
delete from usercontacts where contactid="contactid"
It is not currently possible to create a materialized view that includes a static column, either as part of the primary key or just as a regular column.
Including a static row would require the whole base table (usercontacts) to be read when the static column is changed, so that the view rows could be re-calculated. This has a significant performance penalty.
Having the static row be the view's partition key means that there would only be one entry in the view for all the rows of a partition. However, secondary indexes do work in this case, and you can use that instead.
This is valid for both Scylla and Cassandra at the moment.

cassandra Select on indexed columns and with IN clause for the PRIMARY KEY are not supported

In Cassandra, I'm using the cql:
select msg from log where id in ('A', 'B') and filter1 = 'filter'
(where id is the partition key and filter1 is a secondary index and filter1 cannot be used as a cluster column)
This gives the response:
Select on indexed columns and with IN clause for the PRIMARY KEY are not supported
How can I change CQL to prevent this?
You would need to split that up into separate queries of:
select msg from log where id = 'A' and filter1 = 'filter';
and
select msg from log where id = 'B' and filter1 = 'filter';
Due to the way data is partitioned in Cassandra, CQL has a lot of seemingly arbitrary restrictions (to discourage inefficient queries and also because they are complex to implement).
Over time I think these restrictions will slowly be removed, but for now we have to work around them. For more details on the restrictions, see A deep look at the CQL where clause.
Another option, is that you could build a table specifically for this query (a query table) with filter1 as a partition key and id as a clustering key. That way, your query works and you avoid having a secondary index all-together.
aploetz#cqlsh:stackoverflow> CREATE TABLE log
(filter1 text,
id text,
msg text,
PRIMARY KEY (filter1, id));
aploetz#cqlsh:stackoverflow> INSERT INTO log (filter1, id, msg)
VALUES ('filter','A','message A');
aploetz#cqlsh:stackoverflow> INSERT INTO log (filter1, id, msg)
VALUES ('filter','B','message B');
aploetz#cqlsh:stackoverflow> INSERT INTO log (filter1, id, msg)
VALUES ('filter','C','message C');
aploetz#cqlsh:stackoverflow> SELECT msg FROM log
WHERE filter1='filter' AND id IN ('A','B');
msg
-----------
message A
message B
(2 rows)
You would still be using an "IN" which isn't known to perform well either. But you would also be specifying a partition key, so it might perform better than expected.

Cassandra order by on combination of composite keys

I originally wrote a table that tracks feeds that have been assigned to a user for review.
create table user_feed
{
userid uuid,
languageid uuid,
topicid_uuid,
dateinserted timeuuid,
primary key (userid, languageid, topicid, dateinserted)
};
I realized soon after I created this table that I wouldn't be able to sort this table (order by DESC) by dateinserted because for some weird reason, in Cassandra I can only order by the second (and last) column of a composite key table (as in, the table has to have 2 composite keys and order by can only happen on the second column of this key) so I changed my table to this:
create table user_feed
{
userid uuid,
languageid uuid,
topicid_uuid,
dateinserted timeuuid,
primary key (userid, dateinserted)
};
and now I was able to run a query to get the latest feeds for the user, using order by.
However, I have a new requirement that requires me to sort the feeds by a combination of (languageid + userid) or (topicid + userid) or (languageid + topicid + userid).
I had an idea to create three new tables and have the keys combined into one key column. For example, for userid + topic query, I would use:
create table user_feed_by_topic
{
usertopicidkey text,
dateinserted timeuuid,
primary key (usertopicidkey, dateinserted)
};
where usertopididkey = userid.toString() + topicid.toString().
Of course, this solution requires 4 separate inserts whenever I need to insert a new feed row since I have 4 rows, tracking identical data but partitioned differently to allow sorting.
My question is, is there a better way to do this? Is there any way to achieve what I want (query by a combination of columns and order by another column) or am I stuck with my 4 table design approach?
Many thanks,
Cassandra will order all rows based on the PKs clustering columns. In case your PK is primary key (userid, languageid, topicid, dateinserted) all rows will be sorted by languageid, topicid and dateinserted in ascending order. This implies that all rows will only be sorted within a specific language and topic by date. You'd have to use the date as the first clustering key column to change this behaviour.
Its common practice to denormalize your data across multiple tables to implement different ordering strategies.

Resources