Yugabtye Row-Level Geo-Partitioning - yugabytedb

I wrote a Row-Level Geo-Partitioning POC, and I set up the following link. I found that all the data of the node table are exactly the same.enter code here I want to make the horizontal split how to achieve it,how do you du it, thank you!
https://docs.yugabyte.com/latest/explore/multi-region-deployments/row-level-geo-partitioning/
/usr/local/yugabyte-2.9.0.0/bin/yugabyted start \
--base_dir=/home/yugabyte/yugabyte-data \
--listen=192.168.106.34 \
--master_flags "placement_cloud=aws,placement_region=us-east-1,placement_zone=us-east-1a" \
--tserver_flags "placement_cloud=aws,placement_region=us-east-1,placement_zone=us-east-1a" &
/usr/local/yugabyte-2.9.0.0/bin/yugabyted start \
--base_dir=/home/yugabyte/yugabyte-data \
--listen=192.168.106.23 \
--join=192.168.106.34 \
--tserver_flags "placement_cloud=aws,placement_region=us-east-1,placement_zone=us-east-1a"
CREATE TABLE transactions (
user_id INTEGER NOT NULL,
account_id INTEGER NOT NULL,
geo_partition VARCHAR,
account_type VARCHAR NOT NULL,
amount NUMERIC NOT NULL,
txn_type VARCHAR NOT NULL,
created_at TIMESTAMP DEFAULT NOW()
) PARTITION BY LIST (geo_partition);
CREATE TABLESPACE us_central_1_tablespace WITH (
replica_placement='{"num_replicas": 1, "placement_blocks":
[{"cloud":"aws","region":"us-east-1","zone":"us-east-1a","min_num_replicas":1}]}'
);
CREATE TABLESPACE ap_south_1_tablespace WITH (
replica_placement='{"num_replicas": 1, "placement_blocks":
[{"cloud":"cloud1","region":"datacenter1","zone":"rack1","min_num_replicas":1}]}'
);
CREATE TABLE transactions_us
PARTITION OF transactions
(user_id, account_id, geo_partition, account_type,
amount, txn_type, created_at,
PRIMARY KEY (user_id HASH, account_id, geo_partition))
FOR VALUES IN ('US') TABLESPACE us_central_1_tablespace;
CREATE TABLE transactions_default
PARTITION OF transactions
(user_id, account_id, geo_partition, account_type,
amount, txn_type, created_at,
PRIMARY KEY (user_id HASH, account_id, geo_partition))
FOR VALUES IN ('India') TABLESPACE ap_south_1_tablespace;
INSERT INTO transactions
VALUES (200, 20001, 'India', 'savings', 1000, 'credit');
INSERT INTO transactions
VALUES (300, 30001, 'US', 'checking', 105.25, 'debit');
select * from transactions;
select * from transactions_us;
select * from transactions_default;

Related

SyntaxException: line 2:10 no viable alternative at input 'UNIQUE' > (...NOT EXISTS books ( id [UUID] UNIQUE...)

I am trying the following codes to create a keyspace and a table inside of it:
CREATE KEYSPACE IF NOT EXISTS books WITH REPLICATION = { 'class': 'SimpleStrategy',
'replication_factor': 3 };
CREATE TABLE IF NOT EXISTS books (
id UUID PRIMARY KEY,
user_id TEXT UNIQUE NOT NULL,
scale TEXT NOT NULL,
title TEXT NOT NULL,
description TEXT NOT NULL,
reward map<INT,TEXT> NOT NULL,
image_url TEXT NOT NULL,
video_url TEXT NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
But I do get:
SyntaxException: line 2:10 no viable alternative at input 'UNIQUE'
(...NOT EXISTS books ( id [UUID] UNIQUE...)
What is the problem and how can I fix it?
I see three syntax issues. They are mainly related to CQL != SQL.
The first, is that NOT NULL is not valid at column definition time. Cassandra doesn't enforce constraints like that at all, so for this case, just get rid of all of them.
Next, Cassandra CQL does not allow default values, so this won't work:
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
Providing the current timestamp for created_at is something that will need to be done at write-time. Fortunately, CQL has a few of built-in functions to make this easier:
INSERT INTO books (id, user_id, created_at)
VALUES (uuid(), 'userOne', toTimestamp(now()));
In this case, I've invoked the uuid() function to generate a Type-4 UUID. I've also invoked now() for the current time. However now() returns a TimeUUID (Type-1 UUID) so I've nested it inside of the toTimestamp function to convert it to a TIMESTAMP.
Finally, UNIQUE is not valid.
user_id TEXT UNIQUE NOT NULL,
It looks like you're trying to make sure that duplicate user_ids are not stored with each id. You can help to ensure uniqueness of the data in each partition by adding user_id to the end of the primary key definition as a clustering key:
CREATE TABLE IF NOT EXISTS books (
id UUID,
user_id TEXT,
...
PRIMARY KEY (id, user_id));
This PK definition will ensure that data for books will be partitioned by id, containing multiple user_id rows.
Not sure what the relationship is between books and users is, though. If one book can have many users, then this will work. If one user can have many books, then you'll want to switch the order of the keys to this:
PRIMARY KEY (user_id, id));
In summary, a working table definition for this problem looks like this:
CREATE TABLE IF NOT EXISTS books (
id UUID,
user_id TEXT,
scale TEXT,
title TEXT,
description TEXT,
reward map<INT,TEXT>,
image_url TEXT,
video_url TEXT,
created_at TIMESTAMP,
PRIMARY KEY (id, user_id));

Cassandra : Key Level access in Map type columns

In Cassandra,Suppose we require to access key level against map type column. how to do it?
Create statement:
create table collection_tab2(
empid int,
emploc map<text,text>,
primary key(empid));
Insert statement:
insert into collection_tab2 (empid, emploc ) VALUES ( 100,{'CHE':'Tata Consultancy Services','CBE':'CTS','USA':'Genpact LLC'} );
select:
select emploc from collection_tab2;
empid | emploc
------+--------------------------------------------------------------------------
100 | {'CBE': 'CTS', 'CHE': 'Tata Consultancy Services', 'USA': 'Genpact LLC'}
In that case, if want to access 'USA' key alone . What I should do?
I tried based on the Index. But all values are coming.
CREATE INDEX fetch_index ON killrvideo.collection_tab2 (keys(emploc));
select * from collection_tab2 where emploc CONTAINS KEY 'CBE';
empid | emploc
------+--------------------------------------------------------------------------
100 | {'CBE': 'CTS', 'CHE': 'Tata Consultancy Services', 'USA': 'Genpact LLC'}
But expected:
'CHE': 'Tata Consultancy Services'
Just as a data model change I would strongly recommend:
create table collection_tab2(
empid int,
emploc_key text,
emploc_value text,
primary key(empid, emploc_key));
Then you can query and page through simply as the emploc_key is clustering key instead of part of the cql collection that has multiple limits and negative performance impacts.
Then:
insert into collection_tab2 (empid, emploc_key, emploc_value) VALUES ( 100, 'CHE', 'Tata Consultancy Services');
insert into collection_tab2 (empid, emploc_key, emploc_value) VALUES ( 100, 'CBE, 'CTS');
insert into collection_tab2 (empid, emploc_key, emploc_value) VALUES ( 100, 'USA', 'Genpact LLC');
Can also put it in a unlogged batch and it will still be applied efficiently and atomically because all in the same partition.
To do it as you have you can after 4.0 with CASSANDRA-7396 with [] selectors like:
SELECT emploc['USA'] FROM collection_tab2 WHERE empid = 100;
But I would still strongly recommend data model changes as its significantly more efficient, and can work in existing versions with:
SELECT * FROM collection_tab2 WHERE empid = 100 AND emploc_key = 'USA';

Cassandra UDT search in a list collection

I have this structure in Cassandra
CREATE TYPE IF NOT EXISTS json_test.sensor_frame (
id_secret text,
raw text,
);
CREATE TABLE IF NOT EXISTS json_test.json_table (
user_id text,
timestamp timestamp,
device_id text,
sensor_key text,
sensor_values list<FROZEN<sensor_frame>>,
PRIMARY KEY ((user_id, device_id), timestamp, sensor_key)
) WITH CLUSTERING ORDER BY (timestamp DESC) AND caching = {'keys': 'ALL', 'rows_per_partition': '1000'};
I would like to search for the value of the field id_secret.
I have been trying with these queries without any success:
select sensor_values from json_test.json_table where sensor_values = { id_secret: '703468940' };
select sensor_values from json_test.json_table where sensor_values LIKE {%id_secret: '703468940'%} allow filtering;
select sensor_values from json_test.json_table where sensor_values CONTAINS {id_secret: '703468940'} allow filtering;
Is it possible to make the query? Should I change the structure for making this kind of queries?

Cassandra CQL alternative to OR in WHERE clause

Here's the code I used to create the table:
CREATE TABLE test.packages (
packageuuid timeuuid,
ruserid text,
suserid text,
timestamp int,
PRIMARY KEY (ruserid, suserid, packageuuid, timestamp)
);
and then I create a materialized view:
CREATE MATERIALIZED VIEW test.packages_by_userid
AS SELECT * FROM test.packages
WHERE ruserid IS NOT NULL
AND suserid IS NOT NULL
AND TIMESTAMP IS NOT NULL
AND packageuuid IS NOT NULL
PRIMARY KEY (ruserid, suserid, timestamp, packageuuid)
WITH CLUSTERING ORDER BY (packageuuid DESC);
I want to be able to search for packages sent between two IDs
so I would need something like this:
SELECT * FROM test.packages_by_userid WHERE (ruserid = '1' AND suserid = '2' AND suserid = '1' AND ruserid = '2') AND timestamp > 1496601553;
How would I accomplish something like this with CQL?
I've searched a bit but I can't figure it out.
I'm willing to change the structure of the table if it will make something like this possible.
If it's doable without a materialized view that would also be good.
Use In Clause:
SELECT * FROM test.packages_by_userid WHERE ruserid IN ( '1', '2') AND suserid IN ( '1','2') AND timestamp > 1496601553;
Note : Keep the in clause size smaller, Large in clause in the partition can cause GC pauses and heap pressure that leads to overall slower performance
In practical terms this means you’re waiting on this single coordinator node to give you a response, it’s keeping all those queries and their responses in the heap, and if one of those queries fails, or the coordinator fails, you have to retry the whole thing.
If the multiple partition in clause larger try to use separate query, for each partition (ruserid) with executeAsync.
SELECT * FROM test.packages_by_userid WHERE ruserid = '1' AND suserid IN ( '1','2') AND timestamp > 1496601553;
SELECT * FROM test.packages_by_userid WHERE ruserid = '2' AND suserid IN ( '1','2') AND timestamp > 1496601553;
Learn More : https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/
Since you always search for both sender and receiver, I'd model this with the following table layout:
CREATE TABLE test.packages (
ruserid text,
suserid text,
timestamp int,
packageuuid timeuuid,
PRIMARY KEY ((ruserid, suserid), timestamp)
);
In this way, for each pair of sender/receiver you need to run two queries, one for each partition:
SELECT * FROM packages WHERE ruserid=1 AND suserid=2 AND timestamp > 1496601553;
SELECT * FROM packages WHERE ruserid=2 AND suserid=1 AND timestamp > 1496601553;
This is IMHO the best solution because, remember, in Cassandra you start from your queries and build your table models on that, never the reverse.

Cassandra Schema for retrieving date-ordered records

Folks,
I would like to solve the following with one table in Cassandra. Said service tracks when users open an asset. On subsequent events to the same asset, we simply over-write the accessDate.
example record:
{ userId: "string", assetId: "string", accessDate: unixTimestamp }
With this said, we need to fulfill the following access requirements (each requirement has its own bulletpoint for readability):
Be able to return all assets a user has opened, and at what time.
This is easy to achieve, table could look like:
CREATE TABLE user_assets_tracker (
userId uuid,
accessDate timestamp,
assetId uuid,
PRIMARY KEY (userid, accessDate, assetId)
);
This allows us to query for all assets, and when each was last accessed.
SELECT *
FROM user_assets_tracker
WHERE userId = 522b1fe2-2e36-4cef-a667-cd4237d08b89
ORDER BY accessDate DESC;
>
Dandy. Now the harder bits, which I am unsure about, was hoping you folks could chime in:
Show me all the assets user added in the past 30 days.
Naturally the LIMIT here is not what we need. Also, we may need to have 2 tables to achieve this.
SELECT *
FROM user_assets_tracker
WHERE userid = 522b1fe2-2e36-4cef-a667-cd4237d08b89
ORDER BY accessDate DESC;
LIMIT 10; ?????
Show me the last accessed item for the user. I think this one is easier, the LIMIT 1 solves that.
This is probably straight forward, with this schema:
CREATE TABLE user_assets_tracker (
userId uuid,
accessDate timestamp,
assetId uuid,
PRIMARY KEY (userid, accessDate, assetId)
);
SELECT *
FROM user_assets_tracker
WHERE userid = 522b1fe2-2e36-4cef-a667-cd4237d08b89
ORDER BY accessDate DESC;
LIMIT 1;
Retrieve the full record for a particular userId + assetId
Since accessDate comes before assetId in our schema, I am not sure how to do this as well. Another table?
Thanks!!
PS It seems that SASI Index could be the solution
Though you are always selecting assetid orderby accessDate desc.
Define your schema with order by accessDate desc
CREATE TABLE user_assets_tracker (
userid uuid,
accessdate timestamp,
assetid uuid,
PRIMARY KEY (userid, accessdate, assetid)
) WITH CLUSTERING ORDER BY (accessdate DESC, assetid ASC);
Now you don't need to specify order by accessDate desc every time. it will by default order your data by accessDate desc
Show me all the assets user added in the past 30 days.
First get timestamp of 30 day ago.
Let's current timestamp of 30 day ago is : 2017-02-05 12:00:00+0000
Now you can query :
SELECT * FROM user_assets_tracker WHERE userid = 522b1fe2-2e36-4cef-a667-cd4237d08b89 AND accessdate >= '2017-02-05 12:00:00+0000'
Retrieve the full record for a particular userId + assetId
If you are using Cassandra 3.0 or above you can use Materialized Views
CREATE a Materialized View :
CREATE MATERIALIZED VIEW user_assets AS
SELECT *
FROM user_assets_tracker
WHERE userid IS NOT NULL AND assetid IS NOT NULL AND accessdate IS NOT NULL
PRIMARY KEY (userid, assetid, accessdate);
Now if you want to get all data with userid and assetid, here is the query
SELECT * FROM user_assets WHERE userid = 522b1fe2-2e36-4cef-a667-cd4237d08b89 AND assetid = 1d45e6c2-02a1-11e7-aac5-b9ab92bee74c;
Here is another thing, if huge data is inserted into a single user, you should add time bucket with userid as partition key.For more check the answer https://stackoverflow.com/a/41857183/2320144

Resources