Cannot colocate hash partitioned table in YugabyteDB - yugabytedb

[Question posted by a user on YugabyteDB Community Slack]
I'm trying to run through the demo instructions for row-level geo-partitioning. I've created the parent table and tablespaces, but am getting the following error when attempting to create a partitioned table:
CREATE TABLE transactions_us
PARTITION OF transactions
(user_id, account_id, geo_partition, account_type,
amount, txn_type, created_at,
PRIMARY KEY (user_id HASH, account_id, geo_partition))
FOR VALUES IN ('US') TABLESPACE us_tablespace;
ERROR: Invalid argument: Invalid table definition: Error creating table transactions_us on the master: Cannot colocate hash partitioned table
Are these demo instructions still valid?

Your database is colocated, so tables you create inside the database are colocated by default. For now, we disallow hash partitioned colocated tables, so you got an error.
You have two options:
make the tables range partitioned (specify ASC/DESC instead of HASH)
don't use a colocated database
To try Row Level Geo Partitoning, do not use colocated databases. If you want to use colocation + row level geo partitioning, then you would have to switch to Tablegroups wherein we have a work-in-progress feature that you can track here: https://github.com/yugabyte/yugabyte-db/issues/5823

Related

CassandraQL allow filtering

I am creating a table in cassandra database but I am getting an allow filtering error:
CREATE TABLE device_check (
id int,
checked_at bigint,
is_power boolean,
is_locked boolean,
PRIMARY KEY ((device_id), checked_at)
);
When I make a query
SELECT * FROM device_check where checked_at > 1234432543
But it is giving an allow filtering error. I tried removing brackets from device_id but it gives the same error. Even when I tried setting only the checked_at as the primary key it still wont work with the > operator. With the = operator it works.
PRIMARY KEY in Cassandra contains two type of keys
Partition key
Clustering Key
It is expressed as 'PRIMARY KEY((Partition Key), Clustering keys)`
Cassandra is a distributed database where data can be present on any of the node depending on the partition key. So to search data fast Cassandra asks users to send a partition key to identify the node where the data resides and query that node. So if you don't give parition key in your query then Cassandra complains that you are not querying the right way. Cassandra has to search all the nodes if you dont give it partition key. Thus Cassandra gives a error ALLOW FILTERING if you want to query without partition key.
With respect to > not supported for partition key, answer remains same as when you give a range search in your query then Cassandra has to scan all the nodes for responding which is not the right way to use Cassandra.

Table layout for social app in YugabyteDB

[Question posted by a user on YugabyteDB Community Slack]
I was trying to see if we can avoid data de-normalization using YB’s secondary index , primary table is something like below :
CREATE TABLE posts_by_user(
user_id bigint,
post_id bigserial,
group_ids bigint[] null,
tag_ids bigint[] null,
content text null,
....
PRIMARY KEY (user_id, post_id)
)
-- there could be multiple group ids(up to 20) which user can select to publish his/her post in
-- there could be multiple tag ids(up to 20) which user can select to publish his/her post with
This structure makes fetching by user_id easier but, suppose I want to fetch by group_id(s) or tag_id(s), then either I will need to de-normalize it into secondary tables using YB transaction, which will require additional app logic and also could lead to performance issues because data will be written into multiple nodes based hash primary keys(group_ids and tag_ids).
Or I could use a secondary index to avoid writing additional logic, I have the following doubts regarding that :
YB stable version 2.8 does not allow creating a secondary index on array columns using GIN , any rough timelines it will be available as stable release version ?
will this also suffer same performance issue since multiple index will be updated at the time of client call in multiple nodes based on partition key group_id(s) or tag_id(s) ?
Other ideas are also most welcome for saving data to enable faster queries based on user_id(s), group_id(s), tag_id(s) in a scalable way.
The problem with the GIN index is that it won't be sorted on disk by the timestamp.
You have to create an index for (user_id, datetime desc).
While for groups you can maintain a separate table, with a primary key of (group_id desc, datetime desc, post_id desc). The same for tags.
And on each feed-request, you can make multiple queries for, say, 5 posts on each user_id or group_id and then merge them in the application layer.
This will be the most efficient since all records will be sorted on-disk and in-memory at write-time.

How to model data using Cassandra and Ignite together?

I'm researching how to model data having both Cassandra and Ignite together. So far the basic recommendation of data modeling in Cassandra (coming from this article) is clear: "model data around your queries". An author gives an example of "user lookup". We want to look up for users by their username or their email and according to him the best approach would be having two tables:
CREATE TABLE users_by_username (
username text PRIMARY KEY,
email text,
age int
)
CREATE TABLE users_by_email (
email text PRIMARY KEY,
username text,
age int
)
However things get confusing with Ignite on the top of Cassandra. Unfortunately I could not find any helpful examples or answers to the following questions:
Does having multiple tables that store user information mean having Ignite cache for each of these tables?
Does having compound primary key mean introducing new type for each key and use it as Ignite cache key?
Having Ignite means not having direct reads from Cassandra. Does it even make scene to bother modeling data following NoSql best practices? Would it be ok to just have one user table and let Ignite take care of queries by username or email.
CREATE TABLE users (
id uuid PRIMARY KEY,
username text,
email text,
age int
)
You should probably have one cache per Cassandra table.
If your original key is compound, so should Ignite key be.
You will need to use secondary indexes in Ignite to query by more than one field, and this means you will have to hold all data in Ignite (which is NOT necessary for pure caching scenario). This means enabling readThrough and writeThrough, doing loadCache and always doing all updates through Ignite. You will have to choose between "Ignite as cache for Cassandra" (stick to Cassandra's data layout, can hold partial data) and "Ignite as DB backed by Cassandra" (you can use layout optimal for Ignite, secondary indexes).

Cassandra query collection while specifying a partition key

I've been reading about indexes in Cassandra but I'm a little confused when it comes to creating an index on a collection like a set, list or map.
Let's say I have the following table and index on users like the following
CREATE TABLE chatter.channels (
id text PRIMARY KEY,
users set<text>
);
CREATE INDEX channels_users_idx ON chatter.channels (values(users));
INSERT INTO chatter.channels (id, users) VALUE ('ch1', {'jeff', 'jenny'});
In the docs, at least what I've found so far, says that this can have a huge performance hit because the indexes are created local on the nodes. And all the examples that are given query the tables like below
SELECT * FROM chatter.channels WHERE users CONTAINS 'jeff';
From my understanding this would have the performance hit because the partition key is not specified and all nodes must be queried. However, if I was to issue a query like below
SELECT * FROM chatter.channels WHERE id = 'ch1' AND users CONTAINS 'jeff';
(giving the partition key) then would I still have the performance hit?
How would I be able to check this for myself? In SQL I can run EXPLAIN and get some useful information. Is there something similar in Cassandra?
Cassandra provides tracing capability , this helps to trace the progression of reads and writes of queries in Cassandra.
To view traces, open -> cqlsh on one of your Cassandra nodes and run the following command:
cqlsh> tracing on;
Now tracing requests.
cqlsh> use [KEYSPACE];
I hope this helps in checking the performance of query.

Cassandra: selecting first entry for each value of an indexed column

I have a table of events and would like to extract the first timestamp (column unixtime) for each user.
Is there a way to do this with a single Cassandra query?
The schema is the following:
CREATE TABLE events (
id VARCHAR,
unixtime bigint,
u bigint,
type VARCHAR,
payload map<text, text>,
PRIMARY KEY(id)
);
CREATE INDEX events_u
ON events (u);
CREATE INDEX events_unixtime
ON events (unixtime);
CREATE INDEX events_type
ON events (type);
According to your schema, each user will have a single time stamp. If you want one event per entry, consider:
PRIMARY KEY (id, unixtime).
Assuming that is your schema, the entries for a user will be stored in ascending unixtime order. Be careful though...if it's an unbounded event stream and users have lots of events, the partition for the id will grow and grow. It's recommended to keep partition sizes to tens or hundreds of megs. If you anticipate larger, you'll need to start some form of bucketing.
Now, on to your query. In a word, no. If you don't hit a partition (by specifying the partition key), your query becomes a cluster wide operation. With little data it'll work. But with lots of data, you'll get timeouts. If you do have the data in its current form, then I recommend you use the Cassandra Spark connector and Apache Spark to do your query. An added benefit of the spark connectory is that if you have cassandra nodes as spark worker nodes, due to locality, you can efficiently hit a secondary index without specifying the partition key (which would normally cause a cluster wide query with timeout issues, etc.). You could even use Spark to get the required data and store it into another cassandra table for fast querying.

Resources